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Abstract 

An  important  way  to  learn  new  actions  and  behaviors  is  by  observing  others,  and  several  evolutionary 
games  have  been  developed  to  investigate  what  learning  strategies  work  best  and  how  they  might  have 
evolved.  In  this  paper  we  present  an  extensive  set  of  mathematical  and  simulation  results  for  Cultaptation, 
which  is  one  of  the  best-known  such  games. 

We  derive  a  formula  for  measuring  a  strategy’s  expected  reproductive  success,  provide  algorithms 
to  compute  near-best-response  strategies  and  near-Nash  equilibria,  and  provide  techniques  for  efficient 
implementation  of  those  algorithms.  Our  experimental  studies  provide  strong  evidence  for  the  following 
hypotheses: 

1 .  The  best  strategies  for  Cultaptation  and  similar  games  are  likely  to  be  conditional  ones  in  which  the 
choice  of  action  at  each  round  is  conditioned  on  the  agent’s  accumulated  experience.  Such  strate¬ 
gies  (or  close  approximations  of  them)  can  be  computed  by  doing  a  lookahead  search  that  predicts 
how  each  possible  choice  of  action  at  the  current  round  is  likely  to  affect  future  performance. 

2.  Such  strategies  are  likely  to  exploit  most  of  the  time,  but  will  have  ways  of  quickly  detecting 
structural  shocks,  so  that  they  can  switch  quickly  to  innovation  in  order  to  learn  how  to  respond  to 
such  shocks.  This  conflicts  with  the  conventional  wisdom  that  successful  social-learning  strategies 
are  characterized  by  a  high  frequency  of  innovation;  and  agrees  with  recent  experiments  by  others 
on  human  subjects  that  also  challenge  the  conventional  wisdom. 
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1  Introduction 

An  important  way  to  learn  new  actions  and  behaviors  is  social  learning,  i.e.,  learning  by  observing  others. 
Some  social-learning  theorists  believe  this  is  how  most  human  behavior  is  learned  [1],  and  it  also  is  important 
for  many  other  animal  species  [11,  39,  33].  Such  learning  usually  involves  evaluating  the  outcomes  of 
others’  actions,  rather  than  indiscriminate  copying  of  others’  behavior  [8,  22],  but  much  is  unknown  about 
what  learning  strategies  work  best  and  how  they  might  have  evolved. 

For  example,  it  seems  natural  to  assume  that  communication  has  evolved  due  to  the  inherent  superiority 
of  copying  others’  success  rather  than  learning  on  one’s  own  via  trial-and-error  innovation.  However,  there 
has  also  been  substantial  work  questioning  this  intuition  [4,  25,  2,  28,  12]. 

Several  evolutionary  games  have  been  developed  to  investigate  social  learning  [29,  20,  27,  5].  One  of 
the  best-known  is  Cultaptation,  a  multi-agent  social-learning  game  developed  by  a  consortium  of  European 
scientists  [5]  who  sponsored  an  international  tournament  with  a  €10,000  prize.1  The  rules  of  Cultaptation 
are  rather  complicated  (see  Section  2),  but  can  be  summarized  as  follows: 

'NOTE:  None  of  us  is  affiliated  with  the  tournament  or  with  the  Cultaptation  project. 
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•  Each  agent  has  three  kinds  of  possible  actions:  innovation,  observation,  and  exploitation.  These 
arc  highly  simplified  analogs  of  the  following  real-world  activities,  respectively:  spending  time  and 
resources  to  learn  something  new,  learning  something  by  communicating  with  another  agent,  and 
exploiting  the  learned  knowledge. 

•  At  each  step  of  the  game,  each  agent  must  choose  one  of  the  available  actions.  How  an  agent  does 
this  constitutes  the  agent’s  “social  learning  strategy.” 

•  Each  action  provides  an  immediate  numeric  payoff  and/or  information  about  the  payoffs  of  other 
actions  at  the  current  round  of  the  game.  This  information  is  not  necessarily  correct  in  subsequent 
rounds  because  the  actions'  payoffs  may  vary  from  one  round  to  the  next,  and  the  way  in  which  they 
may  vary  is  unknown  to  the  agents  in  the  game.2 

•  Each  agent  has  a  fixed  probability  of  dying  at  each  round.  At  each  round,  each  agent  may  also  produce 
offspring,  with  a  probability  that  depends  on  how  this  agent’s  average  per-round  payoff  compares  to 
the  average  per-round  payoffs  of  the  other  agents  in  the  game. 

A  second  Cultaptation  tournament  is  scheduled  to  begin  in  February  2012.  This  tournament  carries  a 
€25,000  prize  and  introduces  a  few  new  concepts  into  the  game,  such  as  the  ability  for  agents  to  improve 
actions  they  already  know,  and  proximity-based  observation.  This  paper  does  not  deal  with  these  additions, 
although  we  plan  to  address  them  in  future  work. 

Our  work  has  had  two  main  objectives:  (1)  to  study  the  nature  of  Cultaptation  to  see  what  types  of 
strategies  arc  effective;  and  (2)  more  generally,  to  develop  ways  of  analyzing  evolutionary  environments 
with  social  learning.  Our  results  include  the  following: 

1.  Analyzing  strategies’  reproductive  success  (Section  6).  Given  a  Cultaptation  game  G  and  a  set  S 
of  available  strategies  for  G,  we  derive  a  formula  for  approximating  (to  within  any  £  >  0)  the  expected 
per-round  utility,  EPRUCv  |  G,  S),  of  each  strategy  in  S.3  We  show  that  a  strategy  with  maximal  expected 
per-round  utility  will  have  the  highest  expected  frequency  in  the  limit,  independent  of  the  initial  strategy 
profile.  These  results  provide  a  basis  for  evaluating  highly  complex  strategies  such  as  the  ones  described 
below. 

Generalizability:  These  results  can  be  generalized  to  other  evolutionary  games  in  which  agents  live  more 
than  one  generation,  with  a  fixed  probability  of  death  at  each  generation,  and  reproduction  is  done  using  the 
replicator  dynamic. 

2,  Computing  near-best-response  strategies  (Section  7).  We  provide  a  strategy-generation  algorithm 
that,  given  a  Cultaptation  game  G  and  a  set  of  available  strategies  S,  can  construct  a  strategy  sa  that  is  within 
£  of  the  a  response  to  S.4 

Generalizability:  The  strategy-generation  algorithm  performs  a  finite-horizon  search,  and  is  generaliz- 
able  to  other  evolutionary  games  in  which  there  is  a  fixed  upper  bound  on  per-round  utility  and  a  nonzero 
lower  bound  on  the  probability  of  death  at  each  round. 

2For  our  analyses,  we  assume  the  payoffs  at  each  round  are  determined  by  an  arbitrary  function  (which  may  be  either  deter¬ 
ministic  or  probabilistic),  and  we  analyze  how  strategies  perform  given  various  possible  characteristics  of  that  function.  In  general, 
such  characteristics  would  not  be  known  to  any  Cultaptation  agent — but  our  objective  is  to  examine  the  properties  of  strategies  in 
various  versions  of  Cultaptation,  not  to  develop  a  Cultaptation  agent  per  se. 

'Because  of  how  death  and  mutation  work  in  Cultaptation,  it  follows  that  F.PRIK.y  |  G,  S)  is  the  same  for  every  initial  strategy 
profile  S  composed  of  strategies  in  S.  In  particular,  it  is  the  same  regardless  of  how  many  agents  are  using  each  strategy  when  the 
game  begins. 

4More  precisely,  sa  is  an  e-best  response  to  any  initial  strategy  profile  composed  of  strategies  in  S. 
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3.  Approximating  symmetric  Nash  equilibria  (Section  8).  We  provide  CSLA,  an  iterative  self- 
improvement  algorithm  that  uses  the  strategy-generation  algorithm  in  Section  7  to  produce  a  strategy  .vse|f 
that  is  a  near-best  response  in  a  Cultaptation  game  in  which  the  other  players  arc  all  using  sseif.  Hence  a 
strategy  profile  composed  entirely  of  instances  of  .vseir  is  a  symmetric  near-Nash  equilibrium. 

Generalizability:  An  iterative  self-improvement  algorithm  similar  to  CSLA  should  be  able  to  approxi¬ 
mate  a  Nash  equilibrium  for  any  game  in  which  the  strategies  are  complex  enough  that  computing  a  best  (or 
near-best)  response  is  not  feasible  by  analyzing  the  strategies  directly,  but  is  feasible  using  information  from 
a  simulated  game  between  strategies  in  the  profile.  Games  of  this  type  will  typically  have  a  high  branching 
factor  but  relatively  simple  interactions  between  agents. 

4.  State  aggregation  (Section  7.5).  To  make  our  algorithms  fast  enough  for  practical  experimentation, 
we  provide  a  state-aggregation  technique  that  speeds  them  up  by  an  exponential  factor  without  any  loss  in 
accuracy.  Our  experimental  results  in  Section  9  demonstrate  the  practical  feasibility  that  this  provides:  in 
our  experiments,  CSLA  always  converged  in  just  a  few  iterations. 

Generalizability:  The  state-aggregation  technique  is  generalizable  to  other  evolutionary  games  in  which 
the  utilities  arc  Markovian. 

5.  Experimental  results  (Section  9).  In  our  experimental  studies,  the  near-Nash  equilibria  produced  by 
CSLA  in  any  given  game  were  all  virtually  identical,  regardless  of  the  stalling  values  that  we  used.  That 
strongly  suggests  (though  it  does  not  prove)  that  the  strategy  profile  consisting  of  copies  of  vseir  approximates 
an  optimal  Nash  equilibrium,  and  possibly  even  a  unique  Nash  equilibrium. 

Consequently,  sseif’s  characteristics  provide  insights  into  the  characteristics  of  good  Cultaptation  strate¬ 
gies.  For  example,  our  experiments  show  that  sseif  exploits  most  of  the  time,  but  switches  quickly  to  inno¬ 
vation  when  a  structural  shock  occurs,  switching  back  to  exploitation  once  it  has  learned  how  to  respond  to 
the  shock.  This  conflicts  with  the  conventional  wisdom  [35,  4]  that  successful  social-learning  strategies  are 
characterized  by  a  high  frequency  of  innovation,  but  it  helps  to  explain  both  the  results  of  the  Cultaptation 
tournament  [34]  and  some  recent  experimental  results  on  human  subjects  [38]. 

6.  Implications.  Our  results  provide  strong  support  for  the  following  hypotheses  about  the  best  strategies 
for  Cultaptation  and  similar  games: 

•  Wliat  they  are  like,  and  how  they  can  be  computed.  The  best  strategies  are  likely  to  be  conditional 
ones  in  which  the  choice  of  action  at  each  round  is  conditioned  on  the  agent’s  accumulated  experience. 
Such  strategies  (or  close  approximations  of  them)  can  be  computed  by  doing  a  lookahead  search  that 
predicts  how  each  possible  choice  of  action  at  the  current  round  is  likely  to  affect  future  performance. 

•  How  they  are  likely  to  behave.  It  is  likely  that  the  best  strategies  will  exploit  most  of  the  time,  but 
will  have  ways  of  quickly  detecting  structural  shocks,  so  that  they  can  switch  quickly  to  innovation  in 
order  to  learn  how  to  respond  to  such  shocks. 

2  Cultaptation  Social-Learning  Game 

This  section  gives  a  more  detailed  description  of  the  Cultaptation  social  learning  game,  adapted  from  the 
official  description  [5].  The  game  is  a  multi-agent  round-based  game,  where  one  action  is  chosen  by  each 
agent  each  round.  There  arc  N  agents  playing  the  game,  where  A  is  a  parameter  to  the  game.  No  agent 
knows  of  any  other  agent’s  actions  at  any  point  in  the  game  except  through  the  Obs  action  specified  below. 
The  actions  available  to  each  agent  arc  innovation  (Inv),  observation  (Obs),  and  exploitation  (Xi, . . . ,  X/(, 
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where  p  is  a  parameter  to  the  game).  Each  Inv  and  Obs  action  informs  the  agent  what  the  utility  would  be 
for  one  of  the  exploitation  actions,  and  an  agent  may  not  use  an  exploitation  action  X,  unless  the  agent  has 
previously  learned  of  it  through  an  innovation  or  observation  action.  Here  are  some  details: 

Exploitation.  Each  exploitation  action  X,  provides  utility  specific  to  that  action  (e.g.  Xi  may  provide 
utility  10  and  X2  may  provide  utility  50).  The  utility  assigned  to  each  action  at  the  beginning  of  the  game  is 
drawn  from  a  probability  distribution  7 r,  where  7r  is  a  parameter  to  the  game. 

The  utility  provided  by  each  exploitation  action  X,  may  change  on  round  r,  according  to  a  probability 
cr.  The  function  c  is  a  parameter  to  the  game,  and  specifies  the  probability  of  change  for  every  round  of  the 
game.  When  the  changes  occur,  they  arc  invisible  to  the  agents  playing  the  game  until  the  agent  interacts 
with  the  changed  action.  For  instance:  if  an  action’s  utility  happens  to  change  on  the  same  round  it  is 
exploited,  the  agent  receives  the  new  utility,  and  discovers  the  change  when  the  new  utility  is  received.  The 
new  utility  for  a  changed  action  is  determined  via  the  distribution  n. 

Innovation.  When  an  agent  uses  the  Inv  action,  it  provides  no  utility,  but  it  tells  the  agent  the  name  and 
utility  of  some  exploitation  action  X,  that  is  chosen  uniformly  at  random  from  the  set  of  all  exploitation 
actions  about  which  the  agent  has  no  information.  If  an  agent  already  knows  all  of  the  exploitation  actions, 
then  Inv  is  illegal,  and  indeed  undesirable  (when  there  is  nothing  left  to  innovate,  why  innovate?).  The  agent 
receives  no  utility  on  any  round  where  she  chooses  an  Inv  action. 

Observation.  By  performing  an  Obs  action,  an  agent  gets  to  observe  the  action  performed  and  utility 
received  by  some  other  agent  who  performed  an  exploitation  action  on  the  previous  round.  Agents  receive 
no  utility  for  Obs  actions,  nor  any  information  other  than  the  action  observed  and  its  value:  the  agent  being 
observed,  for  instance,  is  unknown.  If  none  of  the  other  agents  performed  an  exploitation  action  on  the 
previous  round,  then  there  were  no  X,  actions  to  observe  so  the  observing  agent  receives  no  information.  In 
some  valiants  of  the  social  learning  game,  agents  receive  information  about  more  than  one  action  when  ob¬ 
serving.  We  do  not  treat  such  valiants  directly  in  this  paper,  but  it  is  straightforward  to  extend  our  algorithms 
to  take  this  difference  into  account. 

Example  1  Consider  two  strategies:  the  innovate-once  strategy  (hereafter  II ),  which  innovates 
exactly  once  and  exploits  that  innovated  action  (whatever  it  is)  for  the  rest  of  the  game,  and  the 
innovate-twice-observe-once  strategy  (hereafter  120),  which  innovates  twice,  observes  once, 
and  exploits  the  highest  valued  action  of  the  actions  discovered  for  the  rest  of  the  game.  For 
simplicity  of  exposition,  suppose  there  arc  only  four  exploitation  actions:  Xi,  X2,  X3,  and  X4. 

The  values  for  each  of  these  actions  arc  drawn  from  a  distribution;  in  this  example  we  will 
assume  that  they  arc  chosen  to  be  3,  5,  8,  and  5,  respectively.  For  simplicity,  we  will  assume 
the  probability  of  change  is  0.  Suppose  there  arc  two  agents:  one  II  and  one  120.  For  the  first 
action,  II  will  innovate,  which  we  suppose  gives  II  the  value  of  action  X].  On  every  sequential 
action,  II  will  choose  action  Xi,  exploiting  the  initial  investment.  If  the  agent  dies  k  rounds 
later,  then  the  history  of  actions  and  utilities  will  be  that  given  in  Table  1;  giving  a  utility  of 
3 (k  -  1)  and  a  per-round  utility  of  3M-. 

In  contrast,  120  will  innovate,  informing  it  of  the  utility  of  X3:  8,  then  it  will  innovate  again, 
informing  it  of  the  utility  ofXzp  5,  and  finally  it  will  observe.  On  the  second  round,  II  performed 
Xi,  and  since  these  arc  the  only  two  agents,  this  was  the  only  exploitation  action  performed. 
Therefore,  I20's  observation  action  on  the  next  round  must  report  that  another  agent  got  a 
utility  of  3  from  action  Xi  last  round  (if  there  were  multiple  possibilities,  one  would  be  chosen 
uniformly  at  random).  On  round  4,  120  then  knows  that  actions  Xj,  X3,  and  X4  have  utilities 
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Round  # 

1 

2 

3 

4 

5  ... 

k 

11  ’s  action 

Inv 

x, 

Xi 

Xi 

Xi  ... 

Xi 

ll’s  utility 

0 

3 

6 

9 

12  ... 

3 (k  -  1) 

Per  round 

0 

1.5 

2 

2.25 

2.4  ... 

ok—  1 

J  k 

I20's  action 

Inv 

Inv 

Obs 

x3 

X3  ... 

x3 

I20's  utility 

0 

0 

0 

8 

16  ... 

00 

1 

Per  round 

0 

0 

0 

2 

3.2  ... 

Table  1:  Action  sequences  from  Example  1,  and  their  utilities. 


of  3,  8,  and  5,  respectively.  Since  the  probability  of  change  is  0,  the  obvious  best  action  is  X3, 
which  120  performs  for  the  rest  of  her  life.  The  utility  of  120  on  round  k  is  8 (k  -  3),  making  the 
per-round  utility  8^p.  Note  that  on  rounds  2  to  4,  120  will  have  a  worse  per-round  utility  than 
II ,  while  after  round  4,  the  utility  of  120  will  be  higher  (this  is  important  because  reproduction 
is  tied  to  per-round  utility,  as  we  will  show  shortly). 

Formally,  everything  that  an  a  knows  about  each  round  can  be  described  by  an  action-percept  pair, 
(a,  ( m ,  v)),  where  a  e  { I nv,  Obs,  X 1 , . . . ,  X/()  is  the  action  that  a  chose  to  perform,  and  (m,  v)  is  the  percept 

returned  by  the  action.  More  specifically,  m  €  {Xj _ _ X/(,  0}  is  either  an  exploitation  action  or  a  null  value, 

and  v  is  the  utility  observed  or  received.  While  a  is  chosen  by  the  agent,  m  and  v  are  percepts  the  agent 
receives  in  response  to  that  choice.  If  a  is  Inv  or  Obs,  then  v  is  the  utility  of  exploitation  action  m.  If  a 
is  Obs  and  no  agent  performed  an  exploitation  action  last  round,  then  there  is  no  exploitation  action  to  be 
observed,  hence  m  =  0  and  v  =  0.  If  a  is  some  X;,  then  m  will  be  the  same  X,  and  v  will  be  the  utility 
the  agent  receives  for  that  action.  The  agent  history  for  agent  a  is  a  sequence  of  such  action-percept  pairs, 
ha  =  . . .  ,(ak,(m.k,Vk))).  As  a  special  case,  the  empty  (initial)  history  is  (). 

Example  2  The  history  for  120  in  Example  1  is: 

*i20  =  ((Inv,  (X3, 8)),  (Inv,  (X4, 5)),  (Obs,  (X, ,  3)),  (X3,  (X3, 8)), . . .  > 

To  concatenate  a  new  action-percept  pair  onto  the  end  of  a  history,  we  use  the  o  symbol.  For  ex¬ 
ample,  ha  o  (a,  (m,  i’))  is  the  history  ha  concatenated  with  the  action-percept  pair  (a,  (in,  v)).  Further,  for 
ha  =  (pi,P2,  ■  ■  ■  ,Pk),  where  each  p,  is  some  action-percept  pair,  we  let  ha\i\  =  and  ha\i, ...  ,7]  be  the 
subhistory  ( pi , ...,  pj ). 

Strategies.  The  Cultaptation  game  is  ultimately  a  competition  among  strategies.  Here,  a  strategy  is  a 
function  from  histories  to  the  set  of  possible  actions:  s  :  h„  m,  where  ha  is  a  history  of  an  agent  using 
s  and  m  is  Inv,  Obs  or  X,  for  some  i.  Since  each  strategy  may  depend  on  the  entire  history,  the  set  of 
possible  strategies  is  huge;5  but  any  particular  Cultaptation  game  is  a  competition  among  a  much  smaller 
set  of  strategies  S,  which  we  will  call  the  set  of  available  strategies.  For  example,  if  there  are  n  contestants, 
each  of  whom  chooses  a  strategy  to  enter  into  the  game,  then  in  this  case, 

S  =  {the  strategies  chosen  by  the  contestants).  (1) 

5The  number  of  possible  mixed  strategies  is,  of  course,  infinite.  But  even  if  we  consider  only  pure  strategies,  the  number  is  quite 
huge.  We  show  in  Appendix  C  that  for  a  10.000-round  Cultaptation  game  of  the  type  used  in  the  Cultaptation  tournament  [34],  a 
loose  lower  bound  on  the  number  of  pure  strategies  is  10094xl°"  .In  contrast,  it  has  been  estimated  [37]  that  the  total  number  of 
atoms  in  the  observable  universe  is  only  about  1078  to  1082. 
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Each  strategy  in  S  may  be  used  by  many  different  agents,  and  the  strategy  profile  at  each  round  of  the  game 
may  change  many  times  as  the  game  progresses.  When  an  agent  reproduces,  it  passes  its  strategy  on  to  a 
newly  created  agent,  with  the  per-round  utility  of  each  agent  determining  its  likelihood  of  reproduction.  A 
strategy’s  success  is  measured  by  its  average  prevalence  over  the  last  quarter  of  the  game  [5], 

The  replication  dynamics  work  as  follows.  On  each  round,  each  agent  has  a  2%  chance  of  dying.  As 
such,  we  also  include  a  parameter  d  in  our  formulation  representing  the  probability  of  death  (d  defaults 
to  0.02).  Upon  death,  an  agent  is  removed  from  the  game  and  replaced  by  a  new  agent,  whose  strategy 
is  chosen  using  the  reproduction  and  mutation  mechanisms  described  below.  Mutation  happens  2%  of  the 
time,  and  reproduction  happens  98%  of  the  time. 

Reproduction.  When  reproduction  occurs,  the  social  learning  strategy  used  by  the  newborn  agent  is  cho¬ 
sen  from  the  strategies  of  agents  currently  alive  with  a  probability  proportional  to  their  per-round  utility  (the 
utility  gained  by  an  agent  divided  by  the  number  of  rounds  the  agent  has  lived).  The  agent  with  the  highest 
per-round  utility  is  thus  the  most  likely  to  propagate  its  strategy  on  reproduction.  We  now  give  an  example 
of  this. 

Example  3  Again  looking  at  the  sequences  of  actions  in  Table  1,  we  see  that  both  agents 
would  have  equal  chance  of  reproducing  on  round  1.  However,  on  round  2  II  has  a  per-round 
utility  of  1.5,  while  120  has  a  per-round  utility  of  0,  meaning  II  gets  100%  of  the  reproductions 
occurring  on  round  2.  Round  three  is  the  same,  but  on  round  4  II  has  a  per  round  utility  of 
2.25  and  120  has  a  per-round  utility  of  2.  This  means  that  II  gets  100  •  2.25/4.25  =  53%  of  the 
reproductions  and  120  gets  100  •  2/4.25  =  47%  of  the  reproductions  on  round  4. 

Mutation.  In  Cultaptation,  mutation  does  not  refer  to  changes  in  an  agent’s  codebase  (as  in  genetic  pro¬ 
gramming).  Instead,  it  means  that  the  new  agent’s  strategy  s  is  chosen  uniformly  at  random  from  the  set  of 
available  strategies,  regardless  of  whether  any  agents  used  s  on  the  previous  round.  For  instance,  if  there 
were  a  cultaptation  game  pitting  strategies  II  and  120  against  one  another,  then  a  new  mutated  agent  would 
be  equally  likely  to  have  either  strategy  II  or  120,  even  if  there  were  no  living  agents  with  strategy  II . 

Game  Types.  In  the  Cultaptation  tournament  [34],  two  types  of  games  were  played:  pairwise  games 
and  melee  games.  A  pairwise  game  was  played  with  an  invading  strategy  and  a  defending  strategy.  The 
defending  strategy  began  play  with  a  population  of  100  agents,  while  the  invading  strategy  began  with  none. 
Mutation  was  also  disabled  for  the  first  100  rounds,  to  allow  the  defending  strategy  time  to  begin  earning 
utility.  After  100  rounds,  mutation  was  enabled  and  the  invader  had  the  challenging  task  of  establishing  a 
foothold  in  a  population  consisting  entirely  of  agents  using  the  defending  strategy  (most  of  whom  would 
have  had  time  to  find  several  high-payoff  actions).  Since  the  pairwise  games  provide  a  clear  early-game 
advantage  to  the  defender,  they  were  typically  played  twice  with  the  invader  and  defender  swapping  roles 
on  the  second  game.  A  melee  game  was  played  with  n  strategies,  for  some  n  >  2.  Initially,  the  population  of 
100  agents  was  evenly  divided  between  each  strategy  in  the  game.  Mutation  was  disabled  for  the  last  quarter 
of  the  game,  so  that  it  would  not  influence  results  when  strategies  had  similar  fitness. 

Scoring.  If  we  have  k  social  learning  strategies  s\, . . .  ,Sk  playing  Cultaptation,  then  on  any  given  round 
there  will  be  some  number  nj  of  agents  using  strategy  sj,  for  1  <  j  <  k.  Strategy  s/s  score  for  the  game 
is  the  average  value  of  nj  over  the  final  2,500  rounds  of  the  game.  The  strategy  with  the  highest  score  is 
declared  the  winner. 

The  only  way  an  agent  may  affect  nt  is  through  reproduction.  We  will  show  in  Section  6.2  that  any 
strategy  maximizing  an  agent’s  expected  per-round  utility  (defined  in  Section  5.5)  will  also  maximize  its 
reproduction.  We  will  therefore  focus  on  computing  the  expected  per-round  utility. 
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Figure  1 :  An  example  of  a  game  in  which  there  is  a  large  structural  shock.  The  columns  for  the  exploitation 
actions  X,  show  their  values  at  each  round,  and  the  columns  for  agents  A1-A3  show  their  histories.  Note 
that  by  round  6,  all  agents  choose  action  X4,  which  has  changed  to  a  very  low  value.  Since  none  of  the 
agents  arc  innovating,  none  of  them  can  find  the  newly  optimal  action  X3. 

3  Motivating  Discussion 

The  purpose  of  this  section  is  to  explain  the  motivations  for  several  aspects  of  our  work: 

•  Sections  3.1  and  3.2  give  examples  of  types  of  strategies  that  seem  like  they  should  work  well  at 
first  glance,  but  can  have  unexpectedly  bad  consequences.  The  existence  of  such  situations  motivate 
the  algorithms  described  later  in  this  paper,  which  perform  a  game  tree  search  in  order  to  consider 
strategies’  long-term  consequences. 

•  An  important  way  of  getting  insight  into  a  game  is  to  examine  its  best-response  strategies;  and  this 
approach  is  at  the  heart  of  our  formal  analysis  and  our  game-tree  search  algorithms.  Section  3.3 
explains  some  issues  that  arc  important  for  finding  best-response  strategies  in  Cultaptation. 

3.1  Innovation,  Observation,  and  Structural  Shocks 

If  we  want  to  acquire  a  new  action  to  exploit,  then  what  is  the  best  way  of  doing  it:  to  observe,  or  to  innovate? 
At  first  glance,  the  observing  might  seem  to  be  the  best  approach.  If  the  other  agents  in  the  environment  arc 
competent,  then  it  is  likely  that  they  arc  exploiting  actions  that  have  high  payoffs,  hence  we  should  be  able 
to  acquire  a  better  action  by  observing  them  than  by  innovating.  This  suggests  that  an  optimal  agent  will 
rely  heavily  on  observation  actions.  However,  the  following  example  shows  that  relying  only  on  observation 
actions  can  lead  to  disastrous  consequences  if  there  is  a  structural  shock,  i.e.,  a  large  change  in  the  value  of 
an  exploitation  action.6 

Example  4  (structural  shocks)  Figure  1  shows  a  Cultaptation  game  in  which  all  agents  use 
the  following  strategy:  each  agent  begins  with  a  single  Obs  action,  followed  by  a  single  Inv 
action  if  the  Obs  action  returns  0,7  in  order  to  obtain  an  exploitation  action  X,-  which  the  agent 
will  use  in  all  subsequent  rounds. 

Agent  A3  acquires  action  X4  by  doing  an  unsuccessful  Obs  followed  by  an  Inv;  and  A1  and 
A2  acquire  X4  by  observing  A3.  At  first,  X4  is  far  better  than  the  other  exploitation  actions,  so 
all  of  the  agents  do  well  by  using  it.  On  round  8,  the  action  X4  changes  to  the  lowest  possible 

6We  have  borrowed  this  term  from  the  Economics  literature,  where  it  has  an  analogous  meaning  (e.g.,  [10,  14]). 

7This  will  generally  only  happen  on  the  first  round  of  the  game,  before  any  agent  has  obtained  an  exploitation  action. 
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value,  but  the  agents  continue  to  use  it  anyway.  Furthermore,  any  time  a  new  agent  is  born,  it 
will  observe  them  using  X4  and  will  start  using  it  too. 

This  is  a  pathological  case  where  the  best  action  has  disappeared  and  the  agents  arc  in  a  sense  “stuck” 
exploiting  the  suboptimal  result.  Their  only  way  out  is  if  all  agents  die  at  once,  so  that  one  of  the  newly  born 
agents  is  forced  to  innovate.  In  Section  9.1.2,  our  experiments  show  that  in  some  cases,  situations  like  these 
arc  a  big  enough  risk  that  a  near-best  response  strategy  will  choose  innovation  moves  more  frequently  than 
observation  moves. 

3.2  Innovation  and  Observation  Versus  Exploitation 

One  might  also  think  that  agents  should  perform  all  of  their  innovation  and  observation  actions  first,  so  that 
they  have  as  many  options  as  possible  when  choosing  an  action  to  exploit.  However,  as  Raboin  et  al.  [32] 
demonstrate,  this  intuition  is  not  always  correct.  Because  the  game  selects  which  agents  reproduce  based  on 
average  per-round  utility,  not  total  accumulated  utility,  it  is  frequently  better  for  newborn  agents  to  exploit 
one  of  the  first  actions  it  encounters,  even  if  this  action  has  a  mediocre  payoff  (e.g.,  exploiting  an  action 
with  value  10  on  the  second  round  of  an  agent’s  life  gives  it  as  much  per-round  payoff  as  exploiting  an 
action  with  value  50  on  the  tenth  round).  Once  the  agent  has  at  least  some  per-round  utility  so  that  it  has  a 
nonzero  chance  of  reproducing,  it  can  then  begin  searching  for  a  high-valued  action  to  exploit  for  the  rest  of 
its  lifetime. 

3.3  Best-Response  Strategies  in  Cultaptation 

A  widely  used  technique  for  getting  insight  about  a  game  (e.g.,  see  [26])  is  to  look  at  the  game’s  best- 
response  strategies.  Given  an  agent  a  and  a  strategy  profile  (i.e.,  an  assignment  of  strategies  to  agents)  s_q, 
for  the  agents  other  than  a,  a  s  best  response  is  a  strategy  .sopt  that  maximizes  rr’s  expected  utility  if  the  other 
agents  use  the  strategies  in  s_ff. 

In  Cultaptation,  it  is  more  useful  to  consider  a  best  response  to  the  set  of  available  strategies  S,  rather 
than  any  particular  strategy  profile.  During  the  course  of  a  Cultaptation  game,  the  strategy  profile  will 
change  many  times  as  agents  die  and  other  agents  are  born  to  take  their  places.  Each  strategy  in  S  will  be 
scored  based  on  its  average  performance  over  the  course  of  the  game;  and  we  can  show  (see  Section  6.2.1) 
that  given  S,  each  strategy’s  score  is  independent  of  the  initial  strategy  profile  if  the  game  is  sufficiently 
long. 

If  G  is  a  Cultaptation  game  (i.e.,  a  set  of  values  for  game  parameters  such  as  the  number  of  agents,  set 
of  available  actions,  probability  distribution  over  their  payoffs;  see  Section  5  for  details),  then  for  any  agent 
a,  any  set  of  available  strategies  S,  and  any  history  ha  for  0,  there  is  a  probability  distribution  7tQ^s{a\ha,  S) 
that  gives  the  probability  of  observing  each  action  a,  given  S  and  hQ.  Given  /robs  and  G,  we  can  calculate  the 
probability  of  each  possible  outcome  for  each  action  our  agent  might  take,  which  will  allow  us  to  determine 
the  best  response  to  S.  To  compute  /robs  is  not  feasible  except  in  general,  but  it  is  possible  to  compute 
approximations  of  it  in  some  special  cases  (e.g.,  cases  in  which  all  of  the  agents,  or  all  of  the  agents  other 
than  a,  use  the  same  strategy).  That  is  the  approach  used  in  this  paper. 

4  Related  Work 

In  this  section  we  will  discuss  related  work  on  social  learning  and  on  computational  techniques  related  to 


our  own. 


4.1  Social  Learning 

The  Cultaptation  social  learning  competition  offers  insight  into  open  questions  in  behavioral  and  cultural 
evolution.  An  analysis  of  the  competition  is  provided  by  Rendell  et  al.  [34].  Of  the  strategies  entered 
into  the  competition,  those  that  performed  the  best  were  those  that  greatly  favored  observation  actions  over 
innovation  actions,  and  the  top  performing  strategy  learned  almost  exclusively  through  observation.  This 
was  considered  surprising,  since  several  strong  arguments  have  previously  been  made  for  why  social  learning 
isn’t  purely  beneficial  [4,  35].  However,  this  result  is  consistent  with  observations  made  during  our  own 
experiments,  in  which  the  e-best-response  strategy  rarely  did  innovation  (see  Section  9). 

In  previous  work,  Carr  et  al.  showed  how  to  compute  optimal  strategies  for  a  highly  simplified  versions 
of  the  Cultaptation  social  learning  game  [6],  Their  paper  simplifies  the  game  by  completely  removing  the 
observation  action — which  prevents  the  agents  from  interacting  with  each  other  in  any  way  whatsoever, 
thereby  transforming  the  game  into  a  single-agent  game  rather  than  a  multi-agent  game.  Their  model  also 
assumes  that  exploitable  actions  cannot  change  value  once  they  have  been  learned,  which  overlooks  a  key 
part  of  the  full  social  learning  game. 

Wisdom  and  Goldstone  attempted  to  study  social  learning  strategies  using  a  game  similar  to  Cultap¬ 
tation,  but  using  humans  rather  than  computer  agents  [38].  Their  game  environment  consisted  of  a  group 
of  “creatures,”  each  of  which  had  some  hidden  utility.  The  agents’  objective  was  to  select  a  subset  of  the 
creatures  to  create  a  “team,”  which  was  assigned  a  utility  based  on  the  creatures  used  to  create  it.  Agents 
had  a  series  of  rounds  in  which  to  modify  their  team,  and  on  each  round  they  were  allowed  to  see  the  teams 
chosen  by  other  agents  on  the  previous  round  (and  in  some  cases,  the  utility  of  the  other  agents'  teams),  and 
the  object  of  the  game  was  to  maximize  the  utility  of  one’s  team.  In  this  game,  the  acts  of  keeping  a  creature 
on  one’s  team,  choosing  a  creature  that  another  agent  has  used,  and  choosing  a  creature  no  one  has  yet  used 
correspond  to  exploitation,  observation,  and  innovation  (respectively)  in  the  Cultaptation  game. 

The  successful  strategies  Wisdom  and  Goldstone  saw  are  similar  to  those  used  by  the  strategies  found 
by  our  algorithm:  they  keep  most  of  the  creatures  on  their  team  the  same  from  round  to  round  (which 
corresponds  in  Cultaptation  to  performing  mostly  exploitation  actions),  and  new  creatures  arc  mostly  drawn 
from  other  agents’  teams  (which  corresponds  to  preferring  observation  over  innovation  in  Cultaptation). 
However,  Wisdom  and  Goldstone  highlight  these  characteristics  as  interesting  because  they  run  contrary  to 
the  conventional  wisdom  for  social  learning  strategies,  which  suggests  that  broader  exploration  should  lead 
to  better  performance,  and  therefore  that  successful  strategies  should  innovate  more  often  [35].  In  this  case, 
analyzing  the  strategies  found  by  our  algorithm  allowed  us  to  draw  the  same  conclusions  about  what  works 
well.  This  gives  more  evidence  that  the  conventional  wisdom  on  social  learning  [4,  35]  may  be  mistaken. 

How  best  to  learn  in  a  social  environment  is  still  considered  a  nontrivial  problem.  Barnard  and  Sibly 
show  that  if  a  large  portion  of  the  population  is  learning  only  socially,  and  there  arc  few  information  pro¬ 
ducers,  then  the  utility  of  social  learning  goes  down  [2],  Thus,  indiscriminate  observation  is  not  always  the 
best  strategy,  and  there  are  indeed  situations  where  innovation  is  appropriate.  Authors  such  as  Laland  have 
attempted  to  produce  simple  models  for  determining  when  one  choice  is  preferable  to  the  other  [25].  Game 
theoretic  approaches  have  also  been  used  to  explore  this  subject,  but  it  is  still  ongoing  research  [15,  9], 
Giraldeau  et  al.  offer  reasons  why  social  information  can  become  unreliable.  Both  biological  factors,  and 
the  limitations  of  observation,  can  significantly  degrade  the  quality  of  information  learned  socially  [12]. 

Work  by  Nettle  outlines  the  circumstances  in  which  verbal  communication  is  evolutionarily  adaptive, 
and  why  few  species  have  developed  the  ability  to  use  language  despite  its  apparent  advantages  [28].  Nettle 
uses  a  significantly  simpler  model  than  the  Cultaptation  game,  but  provides  insight  that  may  be  useful  to 
understanding  social  learning  in  general.  In  Nettle’s  model,  the  population  reaches  an  equilibrium  at  a  point 
where  both  individual  and  social  learning  occur.  The  point  of  equilibrium  is  affected  by  the  quality  of 
observed  information  and  the  rate  of  change  of  the  environment. 
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N 

Number  of  agents  in  the  environment. 

S 

The  set  of  available  strategies.  Agents  may  only  use  strategies  in  S. 

r 

Number  of  the  current  round,  ranging  from  1  to  oo. 

c 

The  probability  of  change  on  all  rounds. 

d 

The  probability  of  death  on  all  rounds. 

( a ,  (m,  v)) 

An  action-percept  pair  in  which  the  action  a  returns  the  percept  (m,  v). 

ha 

Agent  history  for  a.  A  sequence  of  action-percept  pairs  experienced  by  agent  a. 

ha  |  i  | 

The  i-th  action-percept  pair  in  ha. 

X(ha) 

Number  of  exploitable  actions  given  history  ha. 

b 

Number  of  exploitation  actions  in  the  game. 

n 

Probability  distribution  for  the  new  value  of  any  action  whose  value  changed  at  round  r. 

ttObs(m,v\ha,  S) 

Probability  that  Obs  will  observe  action  m  with  value  v  at  history  ha. 

TTlnvOk) 

Probability  that  Inv  will  return  an  action  with  value  v  on  round  r. 

V 

The  set  of  potential  utility  values. 

P(h'a\ha,a,S) 

Probability  of  transitioning  to  history  h'a  if  a  performs  action  a  at  history  ha. 

L(\ha  |) 

Probability  that  a  lives  long  enough  to  experience  history  h„. 

T 

Set  of  all  action-percept  pairs  of  the  form  (a,  (m,  v)). 

Table  2:  A  glossary  of  notation  used  in  this  paper. 


4.2  Related  Computational  Techniques 

The  restless  bandit  problem,  a  generalization  of  the  stochastic  multi-armed  bandit  problem  that  accounts  for 
probability  of  change,  is  cited  as  the  basis  for  the  rules  of  the  Cultaptation  tournament  [34],  The  rules  of  the 
Cultaptation  game  differ  from  the  restless  bandit  problem  by  including  other  agents,  making  observation 
actions  possible  and  complicating  the  game  significantly.  We  also  show  in  Section  6.2  that  maximizing 
total  payoff,  the  goal  of  the  restless  bandit  problem,  is  different  from  maximizing  expected  per-round  utility 
(EPRU)  of  an  agent  in  the  Cultaptation  tournament. 

The  restless  bandit  problem  is  known  to  be  P .S' A4  CC - c o m p I c t e ,  meaning  it  is  difficult  to  compute  optimal 
solutions  for  in  practice  [30,  13],  Multi-armed  bandit  problems  have  previously  been  used  to  study  the 
tradeoff  between  exploitation  and  exploration  in  learning  environments  [36,  24], 

As  discussed  later  in  Section  5,  finding  a  best-response  strategy  in  Cultaptation  is  basically  equivalent 
to  finding  an  optimal  policy  for  a  Markov  Decision  Process.  Consequently,  our  algorithm  for  finding  near- 
best-response  strategies  has  several  similarities  to  the  approach  used  by  Kearns  et  al.  to  find  near-optimal 
policies  for  large  MDPs  [21].  Both  algorithms  use  the  discount  factor  of  the  MDP  (which,  in  our  case,  is 
the  probability  of  death  d)  and  the  desired  accuracy  e  to  create  a  horizon  for  their  search,  and  the  depth 
ha  of  this  horizon  depends  on  the  discount  factor  and  the  branching  factor,  but  not  on  the  size  of  the  full 
state  space  (unlike  conventional  MDP  algorithms).  Thus,  both  their  algorithm  and  ours  also  have  running 
time  exponential  in  1/e  and  in  the  branching  factor.  However,  the  algorithm  provided  by  Kearns  et  al.  was 
designed  as  an  online  algorithm,  so  it  only  returns  the  near-optimal  action  for  the  state  at  the  root  of  the 
search  tree.  Ours,  on  the  other  hand,  returns  a  strategy  specifying  which  action  the  agent  should  take  for  all 
states  that  can  occur  on  the  first  h„  rounds.  This  means  that  our  exponential-time  algorithm  only  needs  to 
run  once  to  generate  an  entire  strategy,  rather  than  once  per  agent  per  round  in  each  game  we  simulate. 

Many  algorithms  for  optimal  control  of  an  MDP  have  been  developed,  however  they  all  have  running 
time  that  grows  linearly  with  the  size  of  the  state  space  of  the  MDP.  This  makes  them  intractable  for  problems 
like  ours,  which  have  exponentially  large  state  spaces.  Several  approaches  for  near-optimal  control,  which 
produces  a  policy  within  some  e  of  optimal,  have  been  developed  [21,  23,  3], 
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5  Formal  Model 


In  this  section  we  introduce  a  formal  mathematical  model  of  Cultaptation  games.  A  glossary  of  the  notation 
used  in  this  paper  is  provided  as  Table  2. 

Game  Definition.  Cultaptation  requires  a  number  of  parameters  to  determine  exactly  how  it  will  run. 
Therefore,  in  our  formal  model,  we  will  define  the  game  parameters ,  G,  to  be  a  set  of  values  for  the  follow¬ 
ing:  A,  the  number  of  agents;  p,  the  number  of  exploitation  actions  in  the  game;  c,  the  probability  that  an 
exploitation  action  changes  its  utility  each  round;  n,  the  probability  distribution  used  to  assign  a  utility  value 
to  each  exploitation  action,  both  the  outset  and  each  time  an  action’s  utility  changes;  and  cl,  the  probability 
of  death.  In  the  Cultaptation  tournament,  only  the  values  of  N,  p,  and  cl  were  known  ahead  of  time,  but  for 
our  analysis  we  use  the  values  of  the  other  parameters  as  well. 

Recall  that  in  the  Cultaptation  tournament  [5],  each  evolutionary  simulation  was  a  contest  between  two 
or  more  strategies  submitted  to  the  tournament.  Thus,  there  is  a  fixed  set  of  strategies  that  arc  allowed  to 
occur  in  a  given  simulation.  We  will  call  this  the  set  of  available  strategies  S,  where  S  =  { .sq ,  S2, ...,  s/ }  for 
some  finite  £  (i.e.  in  pairwise  games  £  =  2,  in  melee  games  £  >  2).  Any  strategy  profile  s  that  occurs  in  the 
simulation  will  consist  only  of  strategies  in  S.  When  an  agent  is  chosen  to  be  replaced  via  mutation,  its  new 
strategy  is  selected  at  random  from  the  strategies  in  S. 

We  can  now  define  a  Cultaptation  game  formally,  as  follows.  A  Cultaptation  game  is  an  /’-player  game, 
in  which  each  player  receives  the  game  parameters  G  as  input.  Each  player  then  simultaneously  chooses  a 
strategy  to  put  into  the  set  of  available  strategies.  We  will  call  player  i’s  strategy  Sj,  so  that  S  =  ( .v i ,  S2, ...,  57 1. 
The  pair  (G,  S)  is  an  instance  of  G.  In  (G,  S),  each  player  i  will  receive  a  payoff  equal  to  scorchv,),  defined 
below. 

Scoring.  The  version  of  Cultaptation  used  in  the  tournament  continued  for  10,000  rounds,  and  each  strat¬ 
egy  was  assigned  a  score  equal  to  its  average  population  over  the  last  2,500  rounds.  But  as  is  often  done 
in  analyses  of  repeated  games,  our  formal  model  assumes  an  infinite  Cultaptation  game,  i.e.,  the  game  con¬ 
tinues  for  an  infinite  number  of  rounds,  and  the  score  for  strategy  .s'  is  its  average  population  over  the  entire 
game: 

T/j=iP(s,f) 

scorc(.v)  =  lim  - , 

r— >00  7* 

where  p(s,  j )  is  the  population  size  of  agents  using  strategy  .s'  on  round  j.  This  greatly  simplifies  our  analysis 
in  Section  5.5,  by  allowing  us  to  average  out  the  various  sources  of  noise  present  in  the  game. 

Actions.  The  rest  of  the  formal  model  will  be  constructed  from  the  perspective  of  an  arbitrary  agent,  a,  in 
a  given  infinite  Cultaptation  game  instance  (G,  S).  We  use  r  for  the  number  of  a  round,  and  X(ha)  to  specify 
the  number  of  exploitation  actions  available  after  history  ha.  After  all  exploitation  actions  X 1 , . . . .  X/(  have 
been  innovated  or  observed  in  a  history  ha,  then  X(ha)  =  p  and  innovation  actions  become  illegal. 

We  model  the  payoffs  supplied  for  exploitation  actions  X,-  by  a  probability  distribution  n.  n{v)  is  the 
probability  of  an  action  having  payoff  v  at  the  staid  of  the  game  instance.  n( v)  is  also  the  probability  that, 
when  an  action  changes  its  payoff,  the  new  payoff  is  v.  We  let  V  be  the  set  of  all  action  values  that  may 
occur  with  non-zero  probability: 

V  =  {v  1 7 r(v)  >  0). 

We  require  the  set  V  to  be  finite. 
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If  we  let  7T|nv(v|r)  be  the  probability  that  value  v  is  innovated  on  round  r,  it  can  be  defined  recursively  in 
terms  of  c  and  n  as: 


^lnv(vk)  = 


7r(v),  if  r=0, 

cn( v)  +  (1  -  c)7T|nv(v|r  -  1),  otherwise. 


That  is,  initially  the  chance  that  Inv  will  return  an  action  with  value  v  is  determined  by  the  given  distri¬ 
bution  7r(v).  On  later  rounds  (r  >  0)  the  chance  that  Inv  will  return  an  action  with  value  v  is  the  chance  that 
an  action’s  value  changed  to  v  on  the  current  round  (given  by  ctt(v)),  plus  the  chance  that  an  action’s  value 
was  v  on  the  previous  round  and  it  did  not  change  this  round. 

While  computing  the  probability  distribution  for  utilities  of  actions  returned  by  Inv  was  fairly  straight¬ 
forward,  computing  a  similar  distribution  for  Obs  actions  is  significantly  more  difficult.  Let  a  be  any  agent, 
and  S  be  the  set  of  available  strategies.  From  S  we  can  get  a  probability  distribution  over  the  other  agents’ 
actions  in  any  given  situation;  and  from  this  we  can  derive  nQ^s(m,  v\ha,  S),  the  probability  that  Obs  would 
return  the  action-percept  pair  (m,  v),  given  history  1ia. 

In  order  to  derive  /robs-  we  must  consider  each  possible  strategy  profile  s_„  for  agents  besides  a,  de¬ 
termine  how  likely  that  strategy  profile  is  to  occur,  and  then  determine  what  each  agent  in  s_a  will  do  for 
every  possible  sequence  of  actions  they  could  have  encountered,  bounded  only  by  the  percepts  our  agent 
has  received  in  lia.  As  we  discussed  in  Section  3.3,  the  number  of  possible  histories  alone  is  astronomically 
large.  Since  7robs  is  conditioned  on  each  possible  history  it  will  be  larger  still,  so  in  any  practical  implemen¬ 
tation  the  best  we  can  do  is  to  approximate  /robs  (Section  7.5  describes  how  we  will  do  this).  But  for  our 
theoretical  development,  we  will  assume  we  have  an  oracle  for  /robs-  that  will  tell  us  exactly  how  likely  we 
arc  to  observe  any  given  action-utility  pair. 

In  what  follows,  we  will  show  that,  given  n,  /robs-  V,  and  S,  we  can  calculate  the  possible  outcomes  of 
each  action  the  agent  may  take,  and  the  probability  of  each  of  these  outcomes.  This  allows  us  to  treat  an 
infinite  Cultaptation  game  as  a  Markov  Decision  Process  (MDP)  [18].  Calculating  the  best  response  in  this 
case  is  equivalent  to  finding  an  optimal  control  policy  for  an  MDP. 


5.1  Transition  Probabilities 

A  transition  probability  function  P(h'a\ha,  a,  S)  defines  the  probability  of  transitioning  from  history  lia  to  his¬ 
tory  h'a  =  h„  o  {a,  (, m ,  v))  in  the  next  round  if  an  agent  a  performs  action  a.  These  transition  probabilities  arc 
for  the  case  where  a  does  not  die  before  reaching  h'a;  we  introduce  functions  to  account  for  the  probability 
of  death  in  Section  5.2. 

There  arc  three  cases  for  what  P(h’a\ha,  a,  S)  might  be,  depending  on  whether  a  is  an  innovation,  obser¬ 
vation,  or  exploitation  action: 

•  If  a  =  Inv,  then 


, ,  .  _  I  2!sA|v  >  if/?.,  contains  no  percepts  that  contain  the  action  m, 

P(hao(\r\v,  (m,  v))\ha,  Inv,  S)  =  \  F  p 

(  0  otherwise. 

•  Recall  that  an  agent  cannot  innovate  action  m  if  it  has  already  encountered  m  by  innovating  or  observ¬ 
ing.  Observation  actions  arc  not  subject  to  the  same  restriction,  so  if  a  =  Obs,  then 

P(ha  o  (Obs,  (m,  v))| ha,  Obs,  S)  -  /r0bs (m,  v| ha,  S)  (3) 

where  /robs(,?fi  v\ha,  S)  models  the  exploitation  behavior  of  the  other  agents  in  the  environment.  Ob¬ 
viously,  the  exact  probability  distribution  will  depend  on  the  composition  of  strategies  used  by  these 
agents.  The  above  definition  is  general  enough  to  support  a  wide  range  of  environments;  and  in 


(2) 
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Section  7.5  we  will  discuss  one  potential  way  to  model  this  function  for  a  more  specific  set  of  envi¬ 
ronments. 

•  Finally,  if  a  -  Xm,  then  ha  must  contain  at  least  one  percept  for  X,„.  Let  r  be  the  last  round  at  which 
the  last  such  percept  occurred.  For  the  case  where  X„,’s  utility  did  not  change  since  round  r,  we  have 

1*0-1 

P(ha  o  (Xm,  (m,  v))|  ha,  Xm,  S)  =  (1  -  cf'ahr  +  C7r(v)  ^(1  -  c)lhaH  (4) 

prob.  of  not  changing  s _ ^~r  _ ^ 

prob.  of  changing  back  to  v. 


For  the  case  where  Xm’s  utility  did  change  since  round  r,  we  have 

I  ha\ 

P(ha  °  (Xm,  (m,  v))| ha,  Xm,  S)  =  cn(v)  ^(1  -  c)lKhj  (5) 

j=r 


which  is  similar,  but  assumes  that  the  value  must  have  changed  at  least  once. 

In  all  other  cases,  no  transition  from  ha  to  h'a  is  possible,  so  P(h'a\ha,  a,  S)  =  0. 

5.1.1  Probability  of  Reaching  a  History 

We  will  frequently  be  interested  in  P(ha\s,  S),  the  probability  of  history  ha  occurring  given  that  the  agent  is 
following  some  strategy  se  S.  We  will  be  able  to  derive  P(ha\s,  S)  iteratively,  calculating  the  probability  of 
each  step  of  history  lia  occurring  using  the  functions  derived  above. 

Specifically,  P(ha\s,  S)  is  the  probability  that  each  ha\i\  =  (a,,  (mi,  v,))  occurs  given  the  action  chosen  by 
the  strategy  in  the  history  ha[  1 _ =  (a\,  (mi,  vi)) . (a,_i,  v,_i)),  or: 

\ha\-\ 

P(ha\s,  S)  -  n  P(hal  1,  °  K\i  +  HIMl,  •  •  •  ,  /],  s(Klh  /]),  S)  (6) 

i=l 

5.2  Accounting  for  Probability  of  Death 

The  probability  of  an  agent  living  long  enough  to  experience  history  h„  depends  on  the  probability  of  death. 
It  is 


L( \ha\)  =  (1  - 


(V) 


When  we  calculate  the  probability  of  reaching  a  given  history  ha,  we  will  generally  multiply  it  by  L(\ha\) 
to  account  for  the  chance  that  the  agent  dies  before  reaching  lia. 

Sometimes  we  will  also  be  interested  in  the  probability  that  a  randomly-selected  agent  has  history  li„. 
For  this  we  will  need  to  know  the  probability  that  a  randomly-selected  agent  is  exactly  \ha\  rounds  old,  which 
is  simply: 


u\k\) 

m 


L(\ha\) 

i 

l-d -d) 


L(\K  I) 
1 
d 


dL(\ha  |). 


(8) 
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5.3  Utility  Functions 

A  utility  function  U((a,  (in,  v)))  defines  the  utility  gleaned  on  action-percept  pair  (a,  (in,  v)): 


U((a,  (in,  v))) 


v,  if  3/  such  that  a  =  X,, 
0,  otherwise. 


(9) 


Notice  that  (/(•)  is  only  non-zero  on  exploitation  actions. 

The  per-round  utility  (PRU)  of  a  history  ha  =  (a\,  (m  \ ,  v\ ))  o  •  •  •  o  (a\ha\,  (m\ha\,v\ha\))  is  defined  to  be  the 
sum  of  the  utility  acquired  in  that  history  divided  by  the  history’s  length: 


(10) 


5.4  Strategy  Representation 

A  strategy  s  is  defined  as  a  function  mapping  each  history  1ia  e  H  to  the  agent’s  next  action  s(ha)  e 
{ I nv,  Obs,  Xi, . . . ,  X/(|.  For  instance,  the  strategy  II  from  Example  1  is  defined  by  the  function: 

X  _  /  lnv’  if  h<*  is  empty, 

S"K  a)  \  X/,  for  lia  =  (lnv,  (X,,  v)), . . . 

In  this  paper  we  will  deal  with  partially  specified  strategies.  A  partially  specified  strategy  is  a  mixed  strategy 
(i.e.,  a  probability  distribution  over  a  set  of  pure  strategies)  that  is  defined  by  a  finite  set  Q  of  history  action 
pairs  (<2  c  H  x  {lnv,  Obs, X\ , . . . ,  Xfl}),  in  which  each  ha  €  H  appeal's  at  most  once.  Given  any  history  lia, 
if  there  is  an  action  m  such  that  (h„,  in)  e  Q ,  then  sq  chooses  the  action  m.  Otherwise,  sq  chooses  an  action 
arbitrarily  from  all  actions  that  are  legal  in  ha.  Partially  specified  strategies  have  the  advantage  of  being 
guaranteed  to  be  finitely  representable. 


5.5  Evaluating  Strategies 

At  each  round,  an  agent  with  history  h„  has  reproductive  fitness  PRU(/iff),  and  agents  are  selected  to  re¬ 
produce  with  probability  proportional  to  their  reproductive  fitness  (i.e.  using  the  replicator  equation  [16]). 
Since  a  strategy’s  score  is  a  function  of  its  average  population  over  the  course  of  the  game,  we  want  some 
metric  that  allows  us  to  compare  the  expected  reproductive  fitness  of  two  strategies.  This  will  allow  us  to 
predict  which  strategy  is  more  likely  to  win. 

At  first  glance,  it  may  appeal'  that  the  way  to  predict  which  strategy  will  have  higher  expected  reproduc¬ 
tive  fitness  is  to  compare  their  expected  utilities.  However,  prior  work  has  shown  that  this  is  not  the  case:  in 
Cultaptation,  a  strategy’s  expected  reproductive  fitness  is  not  necessarily  proportional  to  its  expected  utility 
[32].  We  now  present  a  simple  example  that  illustrates  this  phenomenon. 

Example  5  (reproductive  fitness  not  proportional  to  expected  utility)  Consider  an  infinite 
Cultaptation  game  with  no  probability  of  change,  no  observation  actions,  probability  of  death 
d  =  0.05,  two  exploitation  actions  valued  at  65  and  100,  and  an  innovate  action  that  will  return 
either  exploitation  action  with  uniform  probability.  This  means  that  an  agent  needs  to  perform 
at  most  two  innovate  actions  to  have  knowledge  of  the  action  with  value  100,  since  innovating 
does  not  return  an  action  the  agent  already  knows. 

We  will  compare  two  strategies:  sn  and  .vn  i.  Both  strategies  will  perform  an  innovate  as  their 
first  action.  If  the  action  they  learn  has  value  100,  both  strategies  will  exploit  that  action  until 
the  agent  dies.  If  the  action  learned  has  value  65,  .s'n  will  perform  a  second  innovate  on  its 
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Table  3:  The  expected  utility  and  expected  reproductive  fitness  of  .sq  and  xn- 1  from  Example  5.  siei  has  a 
higher  expected  reproductive  fitness,  and  therefore  will  be  likely  to  win  a  game  against  ,vn,  even  though  sn 
has  a  higher  expected  utility. 


Expected  Utility 

Expected  Reproductive  Fitness 

•Ol 

90.25 

65.074 

Nei 

88.825 

65.185 

next  turn,  learning  the  action  with  value  100,  and  will  exploit  that  action  until  its  agent  dies. 
Meanwhile,  sjei  will  exploit  the  action  with  value  65  once,  before  performing  an  innovate  on 
its  third  turn  to  learn  the  action  with  value  100.  It  then  exploits  this  action  until  its  agent  dies. 

Since  the  two  strategies  are  identical  when  they  learn  the  action  with  value  100  on  their  first 
action,  and  since  this  case  is  equally  likely  to  be  encountered  by  both  strategies,  we  can  ignore 
it  for  the  purposes  of  comparing  them.  For  the  rest  of  this  analysis  we  will  assume  the  first 
innovate  returns  the  action  with  value  65.  In  this  case,  we  can  calculate  the  expected  utility 
for  both  strategies  using  geometric  series,  and  we  can  calculate  their  expected  reproductive 
fitnesses  using  methods  described  in  Section  6.  Table  3  presents  these  values.  While  .vn  has  a 
higher  expected  utility,  since  it  exploits  the  action  with  value  100  more  often,  siei  has  a  higher 
expected  reproductive  fitness,  since  it  does  not  wait  as  long  to  begin  exploiting.  Therefore,  siei 
will  be  the  likely  winner  in  a  contest  between  these  two  strategies. 

Since  we  cannot  always  use  a  strategy’s  expected  utility  to  determine  whether  it  is  expected  to  win, 
we  will  instead  compute  a  strategy’s  expected  reproductive  fitness  directly,  by  computing  its  Expected  Per- 
Round  Utility. 

Definition.  The  Expected  Per-Round  Utility  for  a  strategy  sa,  EPRU(sa  |  G,  S),  is  the  expected  value  of 
PRU(/r„)  over  all  possible  histories  ha  for  a  randomly-selected  agent  a  using  strategy  sa  €  S  in  an  infinite 
Cultaptation  game  instance  (G,  S).  □ 

To  calculate  EPRU(.v,t.  |  G,  S),  we  look  at  each  possible  history  ha  and  multiply  PRU(7j„)  by  the  proba¬ 
bility  that  a  randomly-chosen  agent  using  sa  has  history  ha.  This  probability  is  equal  to  the  probability  that 
a  randomly-chosen  agent  is  \ha\  rounds  old  (Equation  8)  times  the  probability  of  reaching  history  ha  in  |/;„  | 
steps  using  strategy  sa  (Equation  6).  Hence,  the  EPRU  of  a  strategy  is: 

EPRU(sa|G,S)  =  ^  d£0M)  x  P(ha\sa,S)  x  PRU (M  . 

haeH  portjon  Qf  agents  \ha\  rounds  old  Chance  of  reaching  ha  using  s  Per-round  utility 

Note  that  for  a  given  environment,  the  probability  of  death  d  is  a  constant.  Hence,  in  our  analysis  we 
will  frequently  factor  it  out. 

Example  6  Recall  the  innovate-once  strategy,  which  innovates  once  to  learn  an  action  and  then 
exploits  that  action  until  it  dies.  Suppose  this  strategy  exists  in  an  environment  with  a  probability 
of  death  of  0.2  and  only  one  possible  exploit  action  with  non-changing  value  10.  All  agents 
using  this  strategy  will  therefore  learn  the  only  action  on  their  first  round,  and  then  exploit  an 
action  with  value  10  on  all  subsequent  rounds.  Hence,  there  is  only  one  possible  history  for  a  j- 
round  old  agent  using  this  strategy,  and  its  per-round  utility  is  1 0  •  (j  -  1)1  j.  The  probability  that 
a  randomly-selected  agent  will  be  j  rounds  old  will  be  0.2  •  L(  j)  =  0.2  •  0.8/_l .  Thus  the  expected 
per-round  utility  achieved  by  this  strategy  in  this  environment  is  0-2  •  0.8  ;  l  •  10  •  (J  -  1  )/j. 
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6  Analysis  of  EPRU 


In  this  section  we  examine  methods  for  computing  the  expected  per-round  utility  of  a  strategy.  First  we 
present  a  method  for  computing  an  approximation  to  the  EPRU  for  given  a  strategy,  then  we  present  a  proof 
that  a  strategy  maximizing  EPRU  will  also  maximize  its  average  population  in  an  infinite  Cultaptation  game 
instance. 


6.1  Computation  of  EPRU 

We  will  now  define  a  formula  that  can  be  used  to  compute  EPRU  exactly  for  a  given  strategy  s.  The  definition 
of  EPRU  given  in  Section  5.5  used  a  '‘backward”  view:  for  every  possible  history  ha,  it  looked  back  through 
ha  to  determine  PRU(/j„).  Notice,  however,  that  ha  must  have  some  preceding  history  h'a,  where  ha  =  li',  o  t 
for  some  action-percept  pair  t.  This  definition  of  EPRU  must  examine  h'a  and  lia  independently,  even  though 
their  only  difference  is  the  addition  of  t. 

For  this  reason,  it  will  make  more  sense  computationally  to  use  a  “forward”  view  of  EPRU:  we  will 
construct  a  recursive  function  on  s  and  ha  which,  for  each  possible  ha  o  t: 

•  calculates  the  per-round  utility  gained  from  t,  both  for  history  lia  o  t  and  for  all  histories  that  can  be 
reached  from  ha  o  t,  and  then 

•  recurses  on  s  and  ha  o  t. 


For  the  calculation  in  the  first  bullet,  we  will  use  the  formula  EVexp(r,  v),  which  computes  the  expected 
amount  of  per-round  utility  we  gain  (on  the  current  round  and  on  future  rounds)  by  exploiting  a  value  v  on 
round  r 


E  Vexp(r,v)  =  ^ 
j=r 


L(j)v 

j 


Using  known  properties  of  infinite  series,  EVexp  can  also  be  expressed  as  8 


(11) 


E  Vexp(r,v)  =  v 


=  V 


oo  .  r-l  . 

Z  p1  Z  7<i-rfr' 

j’=  1  i=  1 


(12) 


and  is  therefore  computable. 

We  can  now  express  the  expected  per-round  utility  of  a  strategy  s  recursively  in  terms  of  the  average 
per-round  payoff  of  an  agent. 

EPRUaitU,  K  \G,S)  =  Yj  P(h«  °  H/7«’  S)  •  (WexpQha  °  4  U(t))  +  EPRUaltU,  ha  ot\G,  S))(13) 

teT 

where  T  is  the  set  of  all  possible  action-percept  pairs,  and  ha  o  t  represents  a  possible  history  on  the  next 
round.  Note  that  the  size  of  T  is  finite.  A  proof  that  EPRU(s  |  G ,  S )/d  =  EPRUa|t(.v,  ()  |  G,  S)  is  included  in 
Appendix  A. 

sThe  simplification  2Si  y(l  -  tfU1  =  is  due  to  [7], 
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Unfortunately,  computing  EPRUait  is  not  possible  since  it  suffers  from  infinite  recursion.  To  handle  this, 
we  introduce  a  depth-limited  computation  of  EPRUait,  which  only  computes  the  portion  of  the  total  EPRU 
contributed  by  the  first  k  rounds: 


EPRU^t(s,  ha  |  G,  S)  = 

I  0  If  Jk  =  0 

\  ZteT  °  Aha,  s(ha),  S)(EV  exp(\ha\,  U(t ))  +  EPRU^O,  hot  I  G,  S))  otherwise 


(14) 


We  prove  in  Section  7.3  that  if  the  search  depth  k  is  deep  enough,  EPRU|jt(.v,  ha  \  G,  S)  will  always  be  within 
e  of  EPRUaitU,  ha  I  G,  S). 


6.2  EPRU  Corresponds  to  Reproductive  Success 

This  section  provides  a  proof  that  if  a  strategy  has  the  highest  EPRU  for  the  given  environment,  it  will  also 
have  the  optimal  expected  probability  of  reproducing.  This  proof  applies  only  to  pairwise  games,  but  the 
same  techniques  can  apply  to  arbitrary  (finite)  numbers  of  strategies. 

Assume  we  have  an  infinite  Cultaptation  game  instance  ( G ,  S),  as  defined  in  Section  5,  made  up  of  agents 
using  strategies  s  and  s'  (i.e.  S  =  {.9,  9'}).  Recall  from  Section  5  that  the  score  for  strategy  s  is 

scorc(s)  =  ,Im 

r— >00  y 

where  pis,  i)  is  the  number  of  agents  using  strategy  s  on  round  i.  Our  objective  for  this  section  will  be  to 
show  that  EPRUt.v  |  G,  S)  >  EPRU(V  |  G ,  S)  if  and  only  if  scorc(.v)  >  scorc(.v'). 

We  begin  by  defining  a  reset  event,  which  will  help  us  illustrate  some  interesting  properties  of  infinite 
Cultaptation. 


Definition.  Let  n  and  n'  be  the  number  of  agents  using  s  and  s',  respectively,  on  the  first  round  of  the 
game  instance,  and  let  N  -  n  +  n'.  A  reset  event  occurs  when  all  the  agents  in  the  environment  die  on  two 
consecutive  rounds,  and  on  the  second  round  they  arc  replaced  (via  mutation)  by  n  agents  using  s  and  n' 
agents  using  s'.  The  probability  of  a  reset  event  occurring  is  /?  =  dNdNmN(^ 0.5".  □ 


In  other  words,  after  a  reset  event  occurs  the  conditions  arc  identical  to  those  that  were  present  on  the 
first  round;  the  game  instance  has  essentially  started  over.  Note  that  / 3  is  the  same  on  every  round,  and  it  is 
always  greater  than  0. 

Since  the  game  instance  continues  for  an  infinite  number  of  rounds,  there  will  be  an  infinite  number  of 
reset  events.  Thus,  if  we  were  to  run  other  game  instances  with  S  =  {.v,  s'},  both  strategies  would  have  the 
same  score  each  time.  Therefore,  we  also  know  that  we  can  define  each  strategy’s  score  as  a  function  of  its 
expected  population  at  each  round,  rather  than  its  population  for  a  single  game  instance.  This  gives  us 


lim 

r—>oo 


Z'i=0p(s,  0 

r 


lim 


z;=0ep(s,/) 


(15) 


where  EP(s,  r)  is  the  expected  population  of  agents  using  strategy  9  on  round  r. 

We  will  also  define  EAUEv,  r)  to  be  the  expected  agent  utility  of  strategy  s  on  round  r;  that  is,  EAUEs,  r) 
is  the  expected  PRU  of  a  randomly-chosen  agent  using  strategy  s  on  round  r.  We  can  now  define  EP(.v,  r) 
recursively  for  each  strategy  using  the  mechanics  of  Cultaptation,  as  follows.  We  will  let  EP(s, 0)  =  n  and 
EP(s',  0)  =  n' .  Then,  for  r  >  0 


EP(f,  r  +  1)  = 


(1  -  d)EP(t,  r) 

V - - ' 

Survived  from  previous  round 


+  Nd(\  -  m) 


EP(f,  r)EAU(/,  r) 
TU Jr) 


New  agents  from  selection 


+ 


New  agents  from  mutation 
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where  t  e  j.v,  .s' |  and  TU(r)  =  EPt.v,  r)  EAUt.v,  r)  +  EPt.v',  rjEAUt.v',  r)  is  the  expected  total  utility  on  round  r. 
Recall  from  Section  5  that  N  is  the  total  number  of  agents  in  the  environment,  cl  is  the  probability  of  death, 
and  m  is  the  probability  of  mutation. 

We  now  consider  the  behavior  of  EAUt.v,  r )  as  r  increases. 


Lemma  1  For  any  strategy  s,  lim,woo  EAUt.v,  r)  =  yfor  some  finite  y. 


Proof.  Let  u(s,  r)  be  the  expected  utility  of  a  single  agent  using  strategy  s  when  r  rounds  have  passed  since 
the  first  round  or  the  last  reset  event.  For  all  r,  we  know  that  0  <  u(s,  r)  <  Vmax/d,  since  agents  cannot  earn 
negative  utility,  and  no  strategy  can  do  better  than  exploiting  the  best  possible  action  for  its  entire  expected 
lifespan  of  1  / d  rounds.  We  can  rewrite  EAUt.v,  r)  in  terms  of  u(s,  r)  as  follows. 


EAUU,  r)  =  /3 


Z 

\i=Q 


1  V 

(1  -  P)‘u(  s,  r) 


+  (1  -/ 3)ru(s,r ) 


Taking  the  limit  of  this  form  gives  us 


lim  EAU(s,  r)  -  lim  f 


r- 1 


2(1  -#'«(*,  0 


V  (=0 


+  limfl  —  /3)ru(s,  r)  =  lim  p 


Z 

V  1=0 


1  V 

(1  -  /3)'u(s,  i ) 


Since  u(s,i)  is  bounded  and  2£_q(1  -  /3)1  is  a  geometric  series,  limr_»oo)S(2/=o(l  -  f)'u(s,  i)j  converges 
absolutely  by  the  comparison  test.  Hence,  limr_»oo  EAUt.v,  r)  =  y  for  some  finite  y.  □ 


Lemma  2  For  any  strategy  s„  and  set  of  available  strategies  S, 
lim  EAUf.va,  r )  =  EPRU(.v„  |  G,  S) 


Proof.  The  expected  agent  utility  EAU(5ff,  r)  is  defined  as  the  expected  PRU  of  an  agent  using  strategy  sa 
on  round  r.  As  r  approaches  infinity,  the  probability  that  a  randomly- selected  agent  will  be  i  rounds  old 
approaches  Hi)/  Yf°=Q  L( j)  =  dL{i).  The  probability  of  reaching  a  history  1ia  is  defined  in  Section  5.5  as 
P{ba\sa,  S),  and  as  r  increases  the  set  of  histories  a  randomly-selected  agent  may  have  approaches  H ,  the  set 
of  all  histories.  Thus, 

lim  EAXJ(sa,r)  =  V  dL(\ha\)  x  P(ha\sa,  S)  x  PRUf/ra), 

r— >oo  f  J 

haeH 

which  is  the  definition  of  EPRU(.v„  |  G,  S).  □ 


EP(i,  r)  and  EP(.v',  r)  are  both  functions  of  EAU(.v,  r)  and  EAUt.v',  r),  which  converge  to  EPRUt.v  |  G,  S) 
and  EPRUfi'  |  G,  S)  respectively.  Therefore,  EP(.v,  r)  and  EP(.v',  r)  must  also  converge  as  r  approaches 
infinity.  We  will  let  EP(.v)  =  lim,-^cx,  EP(.v,  r)  for  .v  e  {.v,  .v'|.  We  can  find  the  value  of  EP(.v)  as  follows 


EP(.v)  =  (I  -  djEPt.vj  +  Nd(  I  -  m) 


EP(U  EPRUt.v  |  G,  S) 

EPts)  EPRUU  |  G,  S)  +  EPtY)  EPRU(^'  |  G,  S) 


+  Nd- 


m 


After  substituting  EP(.v')  =  N  -  EP(.v)  and  rearranging  terms,  we  have 


0  =  (EPRUfs  |  G,  S)EP(y)2  -  EPRUtV  |  G,  S)) 

+  a((1  +  ^)EPRU(V  |G,S)-(1  -  ^)EPRUU|G,S)jEP(5)-  A2^  EPRUtV  |G,S). 
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Assume  EPRU(s  |  G,  S)  >  0  and  EPRU(s  |  G,  S)  >  0  and  let  x  =  EPRU(s  |  G,  S)/  EPRU(.v'  |  G,  S).  Then 
we  can  rewrite  the  above  as 

0  -  (x-  l)EP(s)2  +  N(l  +  ^  -x(l  -  |))EP {s)-N2^.  (16) 

This  equation,  when  subject  to  the  constraint  0  <  EP(.v)  <  N,  allows  us  to  express  EP(.v)  as  a  strictly 
increasing  function  of  x.  It  also  has  the  property  that  when  x  =  1,  EP(.v)  =  EP(.v')  =  N/2. 

Lemma  3  EP(.v)  >  EP(.v')  if  and  only  //EPRUt.v  |  G,  S)  >  EPRUt.v'  |  G,  S). 


Proof.  Assume  EPRUt.v  |  G,  S)  >  EPRUt.v'  |  G,  S).  Then  x  >  1  in  Equation  16,  and  therefore  EP(.v)  >  N/2, 
so  EP(.v)  >  EP(.v').  Assume  EP(.v)  >  EP(.v').  Using  Equation  16,  we  know  that  x  >  1  and  therefore  that 
EPRU(s  |  G,  S)  >  EPRUU'  |  G,  S).  Hence,  EP(^)  >  EP(.v')  if  and  only  if  EPRUU  |  G,  S)  >  EPRUU'  |  G,  S). 

□ 

We  can  now  calculate  the  ratio  between  each  strategy’s  score  using  EP(.v)  and  EP(.v'). 


Lemma  4  For  all  s  and  s', 

,.  z;-=0EP(ri/)  z;=0epu',/) 

hm  -  >  lim - 


if  and  only  z/EPU)  >  EP(.v'). 


Proof.  We  know  that 

liny^oo  _  lim^K,,X'=0EP(.v,/-j 

1  i m oo  EPt.v',  /) 

Since  the  sequence  br  =  fj={j  EPt.v',  i )  is  unbounded  and  strictly  increasing,  we  can  use  the  Stolz-Cesaro 
theorem  to  obtain 

lim,woo  X'=0  EPt.v,  i)  _  lim,  ^cx,  EPt.v,  i)  -  £-=0  EPt.v,  i)  _  limMOO  EP(.v,  r)  _  EP(.v) 
limr_^oo  Tj'i=Q  EP(s',  0  lim^oo  EPt.v',  i)  -  2-=o  EPt.v\  i)  lim,-,*,  EP(s',  r)  EP(.v')' 

□ 

From  Equation  15  and  Lemmas  3  and  4,  we  immediately  get  the  following: 


Theorem  1  For  all  s  and  s', 

E;=o  n(s,i)  Z -=o  rts'fi) 

hm  - >  lim - 

/*— »oo  / '  r — ,co  / ' 

if  and  only  if  EPRU(s  |  G,  S)  >  EPRUU'  |  G,  S). 


Therefore,  a  strategy’s  expected  reproductive  success  is  directly  proportional  to  its  EPRU. 
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6.2.1  Irrelevance  of  the  Initial  Strategy  Profile 

From  the  fact  that  EPRU  is  independent  of  the  initial  strategy  profile  s,  we  also  get  the  following  corollary 
which  will  help  us  understand  some  of  our  experimental  results  (see  Section  9). 

Corollary  1  The  initial  strategy  profile  s  of  an  infinite  Cultaptation  game  instance  ( defined  in  Section  5) 
does  not  affect  the  score  of  any  strategy  in  S. 

If  this  seems  counterintuitive,  consider  the  following.  At  the  beginning  of  this  section  we  defined  a  reset 
event,  in  which  every  agent  dies  on  two  consecutive  rounds,  and  all  arc  replaced  via  mutation  so  that  the 
population  is  identical  to  the  initial  strategy  profile.  For  each  reset  event,  there  will  be  many  similar  events 
in  which  every  agent  dies  on  two  consecutive  rounds  and  is  replaced  via  mutation,  but  in  some  arrangement 
different  from  the  initial  strategy  profile.  The  probability  of  this  happening  is  d2NmN ,  which  is  greater  than  0. 
In  an  infinite-length  game,  such  an  event  will  eventually  occur  with  probability  1.  After  it  occurs,  the  initial 
strategy  profile  clearly  has  no  hearing  on  how  the  rest  of  the  game  plays  out,  yet  there  arc  still  an  infinite 
number  of  rounds  left  in  the  game.  Since  each  strategy’s  score  is  its  average  population  over  the  entire  game 
(see  Section  5),  the  impact  of  the  initial  strategy  profile  on  each  strategy’s  total  score  is  vanishingly  small  in 
an  infinite-length  game. 

6.2.2  Application  of  EPRU  to  other  Evolutionary  Games 

Many  of  the  equations  used  in  calculating  EPRU  involve  concepts  particular-  to  Cultaptation,  such  as  inno¬ 
vation,  observation,  and  changing  action  values.  However,  the  general  technique  we  use  is  to  calculate  the 
expected  reproductive  fitness  of  an  agent  on  round  j,  multiply  this  quantity  by  the  expected  proportion  of 
agents  that  are  j  rounds  old,  and  sum  these  quantities  to  get  the  expected  fitness  of  an  entire  population.  This 
should  be  a  useful  metric  in  any  evolutionary  game  in  which  agents  live  for  more  than  one  generation  and 
reproduce  according  to  the  replicator  equation,  even  if  the  game  uses  some  measure  other  than  per-round 
utility  to  determine  reproductive  fitness.  The  proofs  in  this  section  rely  primarily  on  the  symmetry  between 
1)  the  probability  that  an  agent  will  be  alive  after  k  rounds  and  2)  the  expected  proportion  of  a  population 
of  agents  that  are  k  rounds  old  on  any  given  round.  Thus,  any  evolutionary  game  that  allows  agents  to  live 
more  than  one  generation  and  in  which  agents  die  with  the  same  probability  on  every  round  should  be  able 
to  use  a  metric  very  similar  to  EPRU  to  compare  strategies. 

7  Finding  an  c-Best  Response  Strategy 

In  this  section  we  explain  what  it  means  for  a  strategy  to  be  a  best  response  or  near-best  response  in  infinite 
Cultaptation,  and  we  provide  an  algorithm  for  calculating  a  near-best  response  to  S_a,  the  available  strategies 
other  than  our  own. 

7.1  Problem  Specification 

Now  that  we  have  derived  EPRU  and  proved  that  a  strategy’s  EPRU  is  directly  proportional  to  its  score  in 
an  infinite  Cultaptation  game,  we  can  determine  how  each  strategy  in  a  given  set  of  available  strategies  S 
will  perform  by  evaluating  the  EPRU  of  each  strategy.  Therefore,  we  can  define  a  best-response  strategy  in 
terms  of  EPRU,  as  follows. 

Recall  that  in  an  infinite  Cultaptation  game  (as  defined  in  Section  5)  there  are  t  players,  each  of  whom 
selects  a  strategy  to  put  into  the  set  of  available  strategies  S.  Let  S_„  be  the  set  of  available  strategies  other 
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than  our  own,  i.e.  S_q,  =  {si, sa- \ ,  .v„+] , ,S().  We  will  say  that  strategy  .vopl  is  a  best  response  to  S_q,  if 
and  only  if  for  any  other  strategy  s', 

EPRU(sopt  |  G,  S-a  U  .s'opt)  >  EPRUU'  |  G,  S_a  U  s'). 

Computing  .vopL  is  not  possible  due  to  its  prohibitively  large  size.  However,  we  can  compute  an  e-best- 
response  strategy,  i.e.,  a  strategy  s  such  that  EPRUt.v  |  G,  S_ff  U  s)  is  arbitrarily  close  to  EPRLJ(.vopt  |  G,  U 
.vopL).  This  problem  can  be  stated  formally  as  follows:  Given  game  parameters  G,  error  bound  e  >  0,  and  the 
set  S  of  available  strategies  other  than  our  own,  find  a  strategy  sa  such  that  EPRUt.v,,  |  G,  S_„  U  sa)  is  within 
£  of  EPRUGopt  I  G,  S_„  U  5opt). 


7.2  Bounding  EPRU 


In  games  where  0  <  d  <  1,  an  agent  could  potentially  live  for  any  finite  number  of  rounds.  However, 
since  the  agent’s  probability  of  being  alive  on  round  r  decreases  exponentially  with  r,  the  expected  utility 
contributed  by  an  agent’s  actions  in  later  rounds  is  exponentially  lower  than  the  expected  utility  contributed 
by  earlier  rounds.  We  will  use  this  fact  in  deriving  a  bound  on  EPRUajt(.v,  ha  \  G,  S)  for  a  given  strategy  and 
a  history  ha  of  length  /. 

Recall  from  Equations  1 1  and  12  that: 


E  Vexp(r,v)  =  v 


00  i  i  i  r-\  1 

Z1  ,  ,•  i  In  d  i  1  ,  ,•  i 

7<i  -  <0  = v  —rZf'-* 

i=r  V  (=  I 


(17) 


where  EVexp(r,  v)  is  the  expected  contribution  to  EPRU  made  by  exploiting  an  action  with  value  v  on  round 


Since  we  know  how  much  any  given  exploit  contributes  to  the  expected  EPRUUq,  |  G,  S)  for  a  given  strat¬ 
egy  sa,  we  can  calculate  G(l,  v),  the  amount  that  exploiting  the  same  action  on  all  rounds  after  /  contributes 
to  EPRUt.v,,  |  G,  S),  as  follows: 


OO  OO  .  ^  ^  1 

G(/,„)  =  £v£-(i- dr'  =,££  -u  -  <tr' 


j=t+ 1  n=j 

Expanding  the  summations  yields: 


j=l+ 1  n=J 


v)  =  V(r+T(1  “  d)' +  7+2(1  “  d)M  + ' " ) 


n  -  l 


=  v 


=  v 


=  V 


-(1  -  d) 


n—  1 


z  .. 

n=l+ 1 

^  OO  OO  1 

2  <1  -dr1-  2 

vH=/+ 1 

\l-d)‘ 

~  7 

n 


n=l+ 1 


oo  j 

-  V  -(i  -  dr~l 

— i  n 


n=l+ 1 


Next,  we  pull  /  out  of  the  summation  and  use  (17)  to  obtain: 


G(l,  v)  =  v 


(1  -  d)1 


n=  1 


(18) 
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Note  that  for  0  <  d  <  1,  G(l,v )  is  finite.  G(l,v )  provides  a  closed  form  formula  for  the  eventual 
contribution  of  exploiting  an  action  with  value  v  at  every  round  after  the  /th  round.  Since  the  set  V  of 
possible  action  values  is  finite  (see  Section  5),  let  vmax  =  max(  V)  be  the  largest  of  these  values.  Then 
G(l,  vmax )  is  an  upper  bound  on  the  expected  per-round  utility  achieved  after  round  l  (clearly  no  strategy  can 
do  better  than  making  an  action  with  maximal  value  every  action  after  action  /).  We  use  this  fact  to  bound 
the  depth  limited  expected  per-round  utility  computation. 

Theorem  2  Let  vmax  be  the  highest  possible  action  utility  for  game  parameters  G,  and  let  S_a  be  the  set  of 
available  strategies  other  than  our  own.  Then  for  all  1  and  all  strategies  sa, 

EPRUalt(Sa,  <>  |  G,  S-Q,  U  Sq,)  -  EPRU^Sa,  <>  I  G,  S-Q,  U  Sa )  <  G{1,  Vmax). 

Proof.  Since  it  is  not  possible  for  any  strategy  to  gain  more  utility  than  vmax  on  any  round,  this  follows  from 
the  discussion  above.  □ 

Theorem  2  states  that  G{1,  vmax)  is  the  highest  possible  contribution  to  EPRUait(.v,f.,  ()  |  G,  S)  made  by  any 
strategy  sa  after  round  1.  Thus,  if  we  are  given  an  e  >  0  and  we  can  find  a  value  of  k  such  that  G{k,  vmax)  >  e, 
then  we  know  that  no  strategy  can  earn  more  than  e  expected  utility  after  round  k.  The  next  section  will 
show  how  to  find  such  a  k. 

7.3  Determining  How  Far  to  Search 

In  this  section  we  show  how  to  find  a  search  depth  k  such  that,  for  any  given  e  >  0.  no  strategy  can  earn 
more  than  e  utility  after  round  k.  We  first  note  a  bound  on  G{1,  v): 

Lemma  5  G(l,  v)  <  v(l  -  df/d. 

Proof.  The  lemma  follows  from  noting  that  paid  (a)  of  Equation  1 8  is  greater  than  or  equal  to  zero,  since 
=  Z,~i  ^(1  -  d)n~l  and  /  <  oo.  Thus  G(l,v )  =  v( ( 1  ~f>  -w)  <  vn~Jl>  ,  since  w  is  always  non-negative.  □ 

Now  if  we  can  find  a  k  such  that 

£  ~  —  d)  /  d, 

then  we  can  be  certain  that  e  >  G{k,  v).  Solving  for  k  in  the  above  equation  yields 

k  =  log(i-d)[—\,  (19) 

which  has  a  solution  for  0  <  d  <  1  and  vmax  >  0,  both  of  which  will  always  be  true  in  Cultaptation.  This 
gives  us  the  following  theorem. 

Theorem  3  Given  e  >  0,  set  of  available  strategies  S_f>.  other  than  our  own,  and  game  parameters  G  with 
maximal  utility  vmax,  let  k  =  log(\-d)  (^-).  If  sa  has  the  maximal  value  o/EPRU^Sq,,  0  |  G,  S_a  U  sa),  then 
Si,  is  an  e-best  response  to 

Proof.  Let  .vopl  be  the  strategy  with  the  maximal  value  of  EPRU(.vopt  |  G,  S_q.  U  .vopL).  By  Theorem  2,  we 
know  that  .vopt  cannot  earn  more  than  e  expected  utility  on  rounds  after  k.  Since  .v„  earns  the  maximum 
EPRU  possible  in  the  first  k  rounds,  it  follows  that  |  EPRLJ(.v„pi  |  G,  S_q,  U  .vopt)  -  EPRU(.v,t.  |  G,  S_a  U  .v„)|  <  e. 
Therefore,  sa  is  an  e-best  response.  □ 
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Algorithm  1  Produce  strategy  s  that  maximizes  EPRU^hv,  ha  \  G,  S_a  U  s),  given  initial  history  set  of 
possible  utility  values  V,  and  S_a,  the  set  of  available  strategies  other  than  our  own. 

Strat(ha,k,V,  S_a) 

1:  if  k  =  0  then 
2:  return  0 

3:  end  if 
4:  Let  Um ax  —  0 
5:  Let  ,ymax  -  null 

6:  for  each  action  a  e  { I nv,  Obs,  Xi, . . .  ,  X^}  do 
7:  Let  1/temp  —  0 

8:  Let  .Stemp  —  (  /  a-  U) 

9:  for  each  action  m  e  { 1, do 

10:  for  each  value  v  €  V  do 

11:  Let  t  =  (a,  (m,  v)) 

12:  Let  p  -  P(ha  o  t\ha,  a,  S_q) 

13:  if  p  >  0  then 

14:  Let  {S',U'}  =  Strat (ha  o  t,  k  -  1,  V,  S_q) 

15:  ‘Vtcnip  —  ‘''temp  U  S 

16;  1/temp  —  1/temp  +  p(EVexp(\ha°t\,U(t))  +  U') 

17:  end  if 

18:  end  for 

19:  end  for 

20:  if  1/temp  5“  1/max  then 

21:  1/max  —  1/temp 

22:  ■''max  —  ‘''temp 

23:  end  if 

24:  end  for 

25:  return  {.vmax, 

1/max! 


7.4  Algorithm 

We  will  now  present  our  algorithm  for  computing  the  strategy  s  with  the  maximal  value  of 
EPRLJ[i|t(.v.  0  |  G ,  S_q,  U  5),  and  show  how  it  can  be  used  to  compute  an  e-best  response. 

Algorithm  1  returns  a  2-tuple  with  a  partially  specified  strategy  .v  and  a  scalar  U.  Strategy  s  maximizes 
EPRU^hv,  ha  |  G,  S-q,  U  5),  and  U  is  the  value  of  this  expression. 

The  algorithm  performs  a  depth-first  search  through  the  space  of  strategies  that  start  from  the  input 
history  ha,  stopping  once  it  reaches  a  specified  depth  k.  Lor  each  possible  action  a  e  { Inv,  Obs,  Xj, . . . ,  X/( | 
at  it  computes  the  expected  per-round  utility  gained  from  performing  a,  and  the  utility  of  the  best  strategy 
for  each  possible  history  h'a  that  could  result  from  choosing  a.  It  combines  these  quantities  to  get  the  total 
expected  utility  for  a,  and  selects  the  action  with  the  best  total  expected  utility,  amax.  It  returns  the  strategy 
created  by  combining  the  policy  {ha,  amax)  with  the  strategies  for  each  possible  h'a,  and  the  utility  for  this 
strategy. 

Seen  another  way,  Strat(/j(>,  k,  V,  S_a)  computes  EPRU^hv,  ha  \  G,  S-q-  U  s)  for  all  possible  strategies  s, 
returning  the  strategy  maximizing  EPRU|j|t  as  well  as  the  maximal  value  of  EPRU^lt. 

Proposition  1  Strat(ha,  k,  V,  S_ff)  returns  ( s ,  U )  such  that 

EPRU^Cs,  ha  |  G,  S_Q  U  s)  =  U  =  argmax  v,(EPRUflk(.v',  ha  \  G,  S_ff  U  s')). 
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A  proof  of  this  proposition  is  presented  in  Appendix  A. 

We  now  have  an  algorithm  capable  of  computing  the  strategy  with  maximal  expected  utility  over  the 
first  k  rounds.  Hence,  in  order  to  find  an  e-best  response  strategy  we  need  only  find  the  search  depth  k  such 
that  no  strategy  can  earn  more  than  e  expected  utility  after  round  k,  and  then  call  the  algorithm  with  that 
value  of  k. 

Theorem  4  Given  e  >  0,  available  strategies  other  than  our  own  S-a,  and  a  set  of  values  V  with  maximum 
value  vmax,  let  k  =  logn^d)  (ttM-  Then  Strat((fi,k,  V,  S_„)  returns  (s,  U )  such  that  s  is  an  e-best  response  to 

S-a. 

Proof.  This  follows  from  Theorem  3  and  Proposition  1 .  □ 

We  also  have  the  following. 

Corollary  2  Given  available  strategies  other  than  our  own  and  a  set  of  values  V,  let  s/(  be  the  strategy 
returned  by  Strati®,  k,  V,  S_ff).  Then  lim/.^co  st  is  a  best  response  to  S_a. 

Proof.  Let  sopt  be  a  best  response  to  S_().  By  Lemma  5  and  Theorem  3, 

EPRUfSopt  |  G,  S_„  U  .s'0pt)  -  EPRUO*  |  G,  S_a  Us)<  vmax(\  -  d)k/d. 

Since  lim^oo  vmax(  \  -d)k /d  =  0,  it  follows  that  hm^00(EPRU(.s'0pt  |  G,  S_Q.U5opt)-EPRU(^  |  G,  = 

0.  Therefore,  lim^oo  Sk  is  a  best  response  to  S_q,.  □ 

7.5  Implementation 

In  this  section  we  discuss  modifications  that  improve  the  running  time  of  Algorithm  1  without  any  loss 
in  accuracy.  Section  7.5.1  discusses  techniques  for  state  aggregation,  which  allow  us  to  cut  the  branching 
factor  of  the  algorithm  in  half.  Section  7.5.2  discusses  the  representation  of  zrobs-  and  Section  7.5.3  discusses 
caching  and  pruning. 

7.5.1  State  Aggregation 

If  the  pseudocode  for  Algorithm  1  were  implemented  verbatim,  it  would  search  through  each  history  that 
can  be  reached  from  the  stalling  state.  However,  there  is  a  significant  amount  of  extraneous  information 
in  each  history  that  is  not  needed  for  any  of  the  algorithm's  calculations.  For  example,  the  histories  ha  = 
<( I nv,  (1 , 10)))  and  h'a  =  ((Inv,  (2, 10)))  both  describe  a  situation  where  a  innovates  once  and  obtains  an 
action  with  value  10.  The  only  difference  between  these  histories  is  the  identifier  assigned  to  the  action, 
which  does  not  impact  any  of  the  calculations — yet  the  pseudocode  must  still  search  through  each  of  these 
histories  separately.  We  can  eliminate  this  redundancy  by  using  repertoires,  rather  than  histories,  as  the 
states  for  the  algorithm  to  search  through.  A  repertoire  is  a  record  of  what  the  agent  knows  about  each  of 
the  actions  it  has  learned,  rather  than  a  record  of  everything  that  has  happened  to  it. 

Making  this  simple  change  allows  Algorithm  1  to  calculate  the  value  of  an  observation  action  by  com¬ 
bining  information  it  learns  when  exploring  innovate  and  exploit  actions,  rather  than  recursing  again.  This 
cuts  the  branching  factor  of  our  search  in  half.  The  analysis  and  details  involved  in  this  change,  as  well  as 
the  proof  that  the  version  of  the  algorithm  using  repertoires  returns  the  same  result  as  the  previous  version, 
are  included  in  Appendix  B. 
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Running  time  analysis.  When  Algorithm  1  considers  a  history  ha,  it  makes  one  recursive  call  for  each 
possible  action-percept  pair  (a,  ( m ,  v))  that  can  be  executed  at  h„.  There  arc  2/j  such  pairs  for  each  history;  if 
our  agent  knows  how  to  exploit  j  actions,  then  it  can  innovate  any  of  the  n  -  j  actions  it  does  not  know,  and 
it  can  observe  any  of  the  /j  actions.  Each  of  these  actions  can  also  have  any  of  v  values.  Hence,  the  number 
of  recursive  calls  made  by  the  algorithm  each  action  is  at  most  2/iv.  Since  the  algorithm  recurses  to  depth 
k,  the  running  time  for  Algorithm  1  is  0((2/i vf).  With  the  state  aggregation  technique  described  above,  we 
do  not  need  to  perform  additional  recursions  for  observation  actions.  Hence,  the  number  of  recursive  calls 
made  each  action  is  at  most  fiv,  and  the  total  running  time  is  0((juv)k),  which  improves  upon  the  original 
running  time  by  a  factor  of  2k. 

7.5.2  Representing  /robs 

For  our  formal  proofs,  we  treated  /robs  as  a  black  box  that,  when  given  our  agent’s  history  and  round  number, 
could  tell  us  the  exact  probabilities  of  observing  each  action  on  the  current  round.  However,  since  there  arc 
an  exponential  number  of  possible  histories,  storing  /robs  in  this  form  would  require  an  exponential  amount 
of  space,  which  would  severely  limit  the  size  of  games  for  which  we  could  compute  strategies.  We  would 
also  need  to  run  a  prohibitively  large  number  of  simulations  in  Algorithm  2  (introduced  in  Section  8)  to  get 
enough  samples  to  generate  a  new  /robs  °f  this  type. 

Therefore,  as  an  approximation,  our  implementation  assumes  that  /robs  has  a  similar  structure  to  zr|nv, 
and  remains  constant  throughout  the  agent’s  lifetime.  That  is,  the  /robs  uscd  in  the  experiments  returns  the 
probability  of  an  action  valued  v  being  observed.  While  this  leads  to  some  loss  in  accuracy,  it  is  very  easy  to 
store  and  compute.  Further,  we  will  see  in  our  experimental  results  (particularly  those  dealing  with  iterative 
computation  in  Section  9.2)  that  this  form  of  /robs  is  still  able  to  produce  good  strategies. 

7.5.3  Caching  and  Pruning 

Since  our  implementation  uses  repertoires  rather  than  histories  to  represent  the  agent’s  set  of  known  actions, 
and  since  it  is  possible  for  two  histories  to  produce  the  same  repertoire,  the  algorithm  will  sometimes  en¬ 
counter  repertoires  that  it  has  already  evaluated.  So  that  the  algorithm  will  not  have  to  waste  time  evaluating 
them  again,  the  implementation  includes  a  cache  which  stores  the  EPRU  of  every  repertoire  it  has  evalu¬ 
ated.  When  the  algorithm  encounters  a  repertoire  whose  expected  utility  is  needed,  the  implementation  first 
checks  the  cache  to  see  if  the  EPRU  of  the  repertoire  has  been  previously  computed,  and  uses  the  computed 
value  if  it  exists.  Caching  is  widely  used  in  tree-search  procedures,  and  is  analogous  to  the  transposition 
tables  in  chess-playing  algorithms  [31], 

We  also  use  another  well-known  method  for  avoiding  unnecessary  evaluation  of  states,  namely  branch- 
and-bound  pruning  [19,  17],  Before  one  computes  the  expected  per-round  utility  of  a  given  action,  one 
checks  to  see  if  an  upper  bound  on  the  EPRU  of  that  action  would  be  sufficient  to  make  the  given  action’s 
utility  higher  than  the  best  previously  computed  action.  In  many  situations,  the  maximal  utility  that  can  be 
achieved  for  a  given  action  will  in  fact  be  less  than  the  utility  we  know  we  can  achieve  via  some  other  action, 
and  therefore  we  can  skip  the  evaluation  of  that  action  (i.e.  we  can  “prune”  it  from  the  search  tree). 

We  have  no  theoretical  guarantees  on  runtime  reduction  using  these  techniques,  but  we  will  see  in 
Section  9.1.1  that  the  combination  of  pruning  and  caching  allows  us  to  avoid  evaluating  significant  portions 
of  the  state  space  in  the  environments  we  tested. 

8  Cultaptation  Strategy  Learning  Algorithm 

Until  now  we  have  assumed  that  Algorithm  1  has  access  to  /robs-  the  distribution  of  observable  actions,  when 
it  performs  its  calculations.  While  the  algorithm  finds  the  near-best-response  strategy  given  a  particular'  /robs- 
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Algorithm  2  Produce  an  approximation  of  a  strategy  that  is  an  e-best  response  to  itself. 

CSLA(7T|nv,  t,  k ) 

1:  Let  TTQbs  —  ^Inv- 
2:  5  =  0. 

3:  repeat 
4:  Let  Sold  =  S. 

5:  Let  V  =  jn'inv)  ^Obs!- 

6:  5  =  Strat(0,  k,  V,  S) 

7:  Simulate  a  series  of  Cultaptation  games  in  which  s  plays  itself,  and  action  utilities  arc  initially  drawn 

from  7T|nv,  recording  all  actions  exploited  in  the  last  quarter  of  this  game. 

8:  Use  records  of  exploited  actions  to  generate  a  new  distribution  /robs  (he-  ^ObsOO  =  fraction  of  the 

time  v  was  exploited  in  the  records). 

9:  until  stratDilR.v,  50id)  <  t 
10:  return  s. 


agents  playing  the  real  Cultaptation  game  arc  not  given  access  to  /robs  beforehand,  and  even  estimating  what 
7T0bs  looks  like  can  be  very  difficult  while  playing  the  game  due  to  the  limited  amount  of  information  each 
agent  receives  in  its  lifetime.  It  is  also  unclear  how  exactly  an  agent’s  own  actions  will  affect  7Tobs:  by 
exploiting  a  particular  action,  the  agent  is  making  that  action  observable  to  others  who  might  then  exploit  it 
in  greater  proportion  than  in  the  7Tobs  uscd  to  compute  the  agent’s  strategy. 

To  address  these  difficulties,  we  developed  the  Cultaptation  Strategy  Learning  Algorithm  (CSLA),  which 
uses  a  method  for  creating  a  strategy  and  a  distribution  /robs  simultaneously  so  that  (i)  /robs  is  the  distribution 
created  when  all  agents  in  a  Cultaptation  game  play  the  computed  strategy  and  (ii)  the  computed  strategy  is 
a  near-best  response  for  /robs  (and  other  parameters). 

This  algorithm  copes  with  the  lack  of  information  about  /robs-  and  generates  an  approximation  of  a 
strategy  that  is  a  best  response  to  itself.  At  a  high  level,  the  algorithm  can  be  thought  of  as  generating  a 
series  of  strategies,  each  an  6-best  response  to  the  one  before  it,  and  stopping  when  two  successive  strategies 
are  extremely  similar.  A  more  detailed  description  of  this  process  follows. 

The  algorithm  begins  by  assuming  /robs  =  The  algorithm  then  proceeds  iteratively;  at  each  iteration 
it  generates  s,  the  e-best  response  strategy  to  the  current  /robs-  then  simulates  a  series  of  Cultaptation  games 
in  which  5  plays  itself,  and  extracts  a  new  /robs  from  the  actions  exploited  in  these  games. 

At  the  end  of  each  iteration,  the  algorithm  compares  s  to  50id,  the  strategy  produced  by  the  previ¬ 
ous  iteration,  using  the  stratDiff  function.  stratDilR.v,  .v0id)  computes  the  probability  that  an  agent  using 
s  would  perform  at  least  one  different  action  before  dying  than  the  same  agent  using  .v0id -  For  instance, 
stratDiff (s,  .v()id)  =  1.0  means  that  the  two  strategies  will  always  perform  at  least  one  different  action  (i.e.  the 
actions  they  choose  on  the  first  round  are  different),  while  stratDilR.v,  .v0id)  =  0-0  means  that  .v  is  identical  to 

^old- 

When  stratDilR.v,  50id)  is  found  to  be  below  some  threshold  r,  CSLA  terminates  and  returns  s,  the  strat¬ 
egy  computed  by  the  last  iteration.  The  formal  algorithm  is  presented  as  Algorithm  2. 

Properties  of  the  strategy.  In  our  experimental  studies  (see  Section  9),  the  strategies  produced  by  CSLA 
in  any  given  game  were  all  virtually  identical,  even  when  a  random  distribution  (rather  than  /T|nv)  was  used 
to  initialize  /robs-  This  strongly  suggests  (though  it  does  not  prove)  that  the  strategy  profile  consisting  of 
copies  of  .vseir  is  a  symmetric  near- Nash  equilibrium. 

Furthermore,  there  is  reason  to  believe  that  ,v  is  evolutionarily  stable.  Consider  an  environment  in  which 
all  agents  use  the  strategy  s,  and  suppose  a  small  number  (say,  one  or  two)  other  strategies  are  introduced 
as  invaders.  Because  s  was  an  near-best  response  to  the  environment  that  existed  before  the  opponent’s 


26 


agents  arc  introduced,  and  because  the  introduction  of  one  or  two  invaders  will  change  this  environment 
only  slightly,  agents  using  s  will  still  be  using  a  strategy  that  is  close  to  the  best  response  for  the  cur¬ 
rent  environment,  and  they  will  also  have  some  payoff  they  have  accumulated  on  previous  rounds  when 
their  strategy  was  still  an  near-best  response.  Thus,  the  invaders  should  have  a  difficult  time  establishing  a 
foothold  in  the  population,  hence  should  die  out  with  high  probability.  This  suggests  (but  does  not  prove) 
that  s  is  evolutionary  stable.9 

8.1  Implementation  Details 

We  have  created  a  Java  implementation  of  CSLA.  Here  we  briefly  discuss  two  issues  we  dealt  with  during 
implementation. 

8.1.1  Representing  /robs 

Our  implementation  of  CSLA  uses  the  same  representation  of  /robs  as  our  implementation  of  Algorithm  1 
does.  In  other  words,  it  assumes  /robs  has  the  same  form  as  n\ny,  and  remains  constant  throughout  the  game. 
Ideally  we  would  be  able  to  condition  /robs  on  the  agent’s  history,  but  in  practice  this  would  require  too 
much  space  (since  there  arc  an  exponential  number  of  possible  histories),  and  we  would  need  to  run  too 
many  simulations  in  step  7  to  get  an  accurate  distribution  for  each  history. 

8.1.2  Training 

In  the  Machine  Learning  literature,  the  process  of  improving  an  agent’s  performance  on  a  given  task  is  often 
referred  to  as  “training.”  In  Algorithm  2,  strategy  s  is  trained  by  playing  against  itself  in  a  series  of  simulated 
games  in  step  7.  However,  in  our  implementation  of  CSLA  we  have  left  the  agents  involved  in  the  games 
in  step  7  as  a  parameter  to  the  algorithm.  This  means  that  CSLA  can  also  produce  a  strategy  that  is  trained 
by  playing  in  an  environment  consisting  of  itself  and  one  or  more  given  strategics.  The  intuition  behind  this 
approach  is  that  a  strategy  trained  by  playing  against  itself  and  strategy  s'  may  perform  better  when  playing 
against  s'  than  a  strategy  trained  against  itself  alone.  We  test  this  hypothesis  experimentally,  in  Section  9.2. 

9  Experimental  Results 

In  this  section  we  present  our  experimental  results. 

Section  9.1.1  examines  the  performance  of  our  implementation  of  the  e-best  response  algorithm.  We 
find  that  our  optimizations  allow  the  algorithm  to  find  strategies  within  1%  of  the  best  response  1,000  times 
faster  than  the  unoptimized  algorithm.  Section  9.1.2  examines  the  strategies  found  by  the  e-best  response 
algorithm  when  presented  with  different  environments  and  strategy  profiles,  and  the  results  give  us  an  idea 
of  what  kinds  of  circumstances  are  necessary  for  the  near-best-response  strategy  to  prefer  innovation  over 
observation. 

Section  9.2  presents  a  series  of  experiments  comparing  two  strategies  generated  with  our  Cultaptation 
Strategy  Learning  Algorithm  to  a  known  good  strategy  used  in  the  international  Cultaptation  tournament. 
We  find  that  the  strategies  generated  with  CSLA  arc  able  to  beat  the  known  good  strategy,  even  when  the 
environment  is  different  than  the  one  CSLA  used  to  learn  the  strategies  (Sections  9.2.2  and  9.2.3).  Finally, 
we  perform  an  in-depth  qualitative  analysis  of  all  three  strategies  and  highlight  the  differences  in  behavior 
that  give  our  learned  strategies  an  advantage  (Section  9.2.4). 

9  Among  other  things,  a  formal  proof  would  require  a  way  to  calculate  the  payoffs  for  .?  and  any  invading  strategy.  Accomplishing 
this  is  likely  to  be  complicated,  but  we  hope  to  do  it  in  our  future  research. 
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Table  4:  Expected  per-round  utility  of  the  e-best  response  strategy  computed  by  Algorithm  4,  for  eight 
different  values  of  e  in  various  environments. 
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9.1  Experiments  with  e-Best  Response  Algorithm 

In  this  section  we  present  the  experiments  involving  our  implementation  of  Algorithm  4,  which  generates 
e-best-response  strategies  for  a  given  set  of  game  parameters  and  strategy  profile. 

9.1.1  Implementation  Performance 

Our  first  set  of  experiments  was  designed  to  study  the  accuracy  and  running  time  of  our  implementation,  and 
the  effectiveness  of  the  methods  we  have  developed  to  improve  its  performance.  We  first  examined  the  effect 
of  e  on  running  time  and  on  the  expected  per-round  utility  computed  by  Algorithm  4.  We  ran  the  experiments 
in  several  different  environments:  first  we  examined  the  uniform  1  environment.  In  this  environment,  n\m 
is  a  uniform  distribution  over  the  values  in  {33.33, 66.67 , 100, 133.33, 166.67),  and  S  contains  the  innovate  - 
once  (II)  strategy  from  Example  1,  so  7T|nv  is  identical  to  zrobs-  The  probability  of  change  in  uniforml  is 
1%,  and  the  probability  of  death  is  40%. 

We  also  introduced  several  variations  on  the  uniforml  environment  to  study  the  effect  of  different  prob¬ 
abilities  of  change.  They  are  uniforml  0,  uniform20,  uniform30,  and  uniform40,  which  have  the  respective 
probabilities  of  change  of  10%,  20%,  30%,  and  40%. 

In  Table  4,  we  see  the  EPRU  computed  for  various  values  of  epsilon  in  these  environments.  As  a  point 
of  reference,  strategy  II  can  be  analytically  shown  to  achieve  an  EPRU  of  about  38.56.  We  can  see  that 
an  upper  bound  on  achievable  EPRU  in  the  uniforml  environment  is  40.5,  since  the  EPRU  of  an  e-best 
response  to  S  is  40.1  when  e  is  0.4.  Also,  we  note  that  the  algorithm  finds  lower  EPRUs  as  the  probability 
of  change  increases.  This  is  as  expected:  in  a  rapidly  changing  environment,  one  cannot  expect  an  agent 
to  do  as  well  as  in  a  static  environment  where  good  actions  remain  good  and  bad  actions  remain  bad.  The 
e-best-response  strategies  computed  generally  innovate  as  the  first  action,  then  exploit  that  value  if  it  is  not 
the  lowest  value  available  (in  this  case  33.33).  Otherwise,  the  strategies  tend  to  innovate  again  in  an  attempt 
to  find  an  action  with  a  value  bigger  than  33.33.  This  is  how  they  manage  to  achieve  a  higher  EPRU  than 
the  innovate-once  strategy. 

As  part  of  the  experiment  in  the  uniforml  environment,  we  kept  track  of  the  number  of  nodes  searched 
by  four  variations  of  the  algorithm.  In  the  first  variation,  we  ran  Strati  A?,  r,  k,  V,  S)  (Algorithm  4)  without 
optimizations.  We  also  examined  the  algorithm's  performance  with  the  pruning  and  caching  optimizations 
described  in  Section  7.5.3. 

In  Figure  2  we  see  that  employing  both  caching  and  pruning  allows  us  to  compute  strategies  within  1  % 
of  the  best  response  about  1,000  times  faster.  We  note  that  the  search  times  required  for  80,000-node  search 
is  around  15  seconds  on  a  3.4GHz  Xeon  processor. 
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Figure  2:  Number  of  nodes  searched  in  the  uniform  1  environment,  with  different  combinations  of  caching 
and  pruning. 

9.1.2  Effects  of  Varying  7T|nv  and  /r0bs  0,1  the  6-Best-Response 

The  objective  of  this  experiment  was  to  study  how  near-best-response  strategies  (as  computed  by  Algo¬ 
rithm  4)  change  as  we  vary  the  mean  and  standard  deviation  of  7T|nv  and  7robs  (which  we  will  call  pinv,  cr|nv, 
Fobs  ar*d  o"obs>  respectively).  If  we  assume  that  the  other  agents  in  the  game  are  rational  and  not  trying  to 
deceive  us  by  intentionally  exploiting  low-utility  actions,  then  one  should  expect  that  pobs  -  Flnv-  It  may 
seem  natural,  then,  to  conclude  that  an  agent  should  choose  to  observe  rather  than  innovate  whenever  possi¬ 
ble,  since  the  average  action  returned  by  observing  will  have  higher  utility  than  one  returned  by  innovating. 
However,  previous  work  has  suggested  that  the  standard  deviation  of  these  distributions  may  also  play  a  role 
in  determining  which  is  better  [32].  Also,  as  discussed  in  Section  3.1,  it  is  possible  to  imagine  pathological 
scenarios  where  a  population  that  relies  too  heavily  on  observation  can  become  stuck  exploiting  a  low-value 
action.  We  designed  this  experiment  to  test  the  hypothesis  that,  even  if  we  let  pobs  >  Flnv>  we  can  still  vary 
the  standard  deviations  of  these  distributions  such  that  the  6-best-response  strategy  computed  by  EPRU*lt 
will  choose  to  innovate  rather  than  observe.  Our  methods  and  results  are  presented  below. 

We  used  the  repertoire -based  algorithm  St  rat  (A',  r,  k,  V)  (Algorithm  4)  to  compute  6-best-response  strate¬ 
gies  for  Cultaptation  games  with  several  different  parameter  settings,  then  analyzed  the  strategies  to  deter¬ 
mine  how  often  they  would  observe  or  innovate.  In  this  experiment,  the  agents  died  with  40%  probability 
on  each  round  ( d  =  0.4)  and  there  were  5  potential  exploitation  actions.  These  games  are  smaller  than  the 
Cultaptation  game  used  in  the  tournament,  to  ensure  they  can  be  solved  in  a  reasonable  amount  of  time.  In 
our  games,  we  used  distributions  /robs  ar|d  /T|nv  with  means  of  1 10  and  100  respectively. 

Table  5  shows  the  results  for  four  combinations  of  parameter  settings:  o-|nv  e  [10,300]  and  crobs  € 
[10, 300].  When  o-|nv  =  10,  the  near-best-response  strategy  will  observe  almost  exclusively  (innovating  only 
in  rare  cases  where  observation  returns  several  low-quality  moves  in  a  row).  However,  in  the  environment 
where  cr|nv  =  300  the  near-best-response  strategy  includes  significantly  more  Innvoates;  when  crobs  =  10  it 
will  innovate  99.5%  of  the  time,  and  even  when  cr0bs  =  300  it  still  innovates  21.5%  of  the  time. 

This  experiment  lets  us  conclude  that  the  means  of  7T|nv  and  /robs  are  not  sufficient  to  determine  if 
innovation  or  exploitation  is  better.  In  particular,  if  the  standard  deviation  of  innovated  values  is  high,  then 
innovation  becomes  more  valuable  because  multiple  innovations  tend  to  result  in  a  higher  valued  action  than 
multiple  observations. 

An  interesting  strategy  emerges  when  7T|nv  and  7robs  both  have  high  standard  deviations.  Even  though 
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Table  5:  The  portion  of  innovation  actions  (calculated  as  ninv/Odnv  +  /fobs))  in  die  e-best-response  strategy 
when  the  standard  deviations  of  7T|nv  and  7Tobs  are  as  specified.  In  all  cases,  jU|nv  =  100  and  /iQbs  =  1 10. 
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the  mean  value  of  innovated  actions  is  lower  than  the  mean  value  of  observed  actions,  the  e-best-response 
strategy  in  these  cases  innovates  initially,  then,  if  the  value  innovated  is  high,  exploits  that  value.  If  the 
innovated  value  is  not  high,  an  observation  action  is  performed  to  ensure  the  agent  has  a  reasonably-valued 
action  available  to  exploit  until  it  dies. 

9.2  Experiments  with  the  Cultaptation  Strategy  Learning  Algorithm 

The  objective  of  our  second  experiment  was  to  examine  the  performance  of  strategies  produced  by  the 
Cultaptation  Strategy  Learning  Algorithm  (Algorithm  2  in  Section  8),  and  the  importance  of  the  environment 
(see  Section  8.1.2)  used  to  train  these  strategies.  Specifically,  we  were  interested  in — 

•  examining  whether  the  strategies  produced  with  CSLA  were  capable  of  beating  a  strategy  that  is 
known  to  do  well; 

•  examining  whether  strategies  produced  by  CSLA  were  able  to  perform  well  in  environments  different 
from  those  they  were  trained  in; 

•  comparing  how  well  a  strategy  that  is  trained  only  against  itself  (i.e.  all  agents  in  the  simulated  game 
in  Step  6  of  the  CSLA  algorithm  use  strategy  s )  can  do  at  repelling  an  invader,  versus  how  well  a 
strategy  trained  against  the  invader  (i.e.  the  invading  strategy  is  included  in  the  population  of  agents 
at  Step  6)  can  do  at  repelling  the  invader. 

For  the  previous  experiments,  we  assumed  we  had  an  oracle  for  /robs-  For  the  rest  of  this  section  we  will 
be  running  experimental  simulations,  so  our  oracle  will  observe  what  the  agents  do  in  the  simulations  and 
construct  7Tobs  from  this,  as  described  in  Section  8. 

For  the  known  good  strategy  we  used  an  algorithm  called  EVChooser,  which  performs  a  few  innova¬ 
tion  and  observation  actions  early  in  the  game  and  uses  the  results  of  these  actions  (along  with  a  discount 
factor)  to  estimate  the  expected  value  of  innovating,  observing,  and  exploiting,  making  the  action  with  the 
highest  expected  value.  It  placed  15th  out  of  over  100  entries  in  the  Cultaptation  tournament  [5],  We  chose 
EVChooser  because  (1)  it  has  been  shown  to  be  a  competitive  strategy,  (2)  since  we  had  written  it,  its  source 
code  was  readily  available  to  us  (unlike  the  other  successful  strategies  from  the  Cultaptation  tournament), 
and  (3)  it  could  be  tuned  to  perform  well  in  the  Cultaptation  environments  we  used  (which,  in  order  to 
accommodate  CSLA’s  exponential  running  time,  were  much  smaller  than  those  used  in  the  international 
Cultaptation  tournament). 

For  games  as  small  as  the  ones  in  our  experiments,  we  believe  EVChooser  is  representative  of  most  of 
the  high-performing  strategies  from  the  tournament.  Nearly  all  of  the  strategies  described  in  the  tournament 
report  [5]  spend  some  time  trying  to  figure  out  what  the  innovate  and  observe  distributions  look  like,  and 
afterwards  use  some  heuristic  for  choosing  whether  to  innovate,  observe,  or  exploit  their  best  known  action 
on  any  given  round.  This  heuristic  often  involves  some  time  of  expected-value  computation;  for  instance, 
the  winning  strategy  discountmachine  used  a  discount  factor  to  compare  the  utility  gained  by  exploiting  the 
current  best-known  action  to  the  utility  of  possibly  learning  a  better  action  and  exploiting  it  on  all  future 
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rounds,  which  is  exactly  what  EVChooser  does.10  Unlike  our  CSLA  algorithm,  none  of  the  strategies  in  the 
tournament  conducted  lookahead  search. 

For  this  experiment,  we  used  an  environment  where  7T|nv  was  a  uniform  distribution  over  the  actions 
{20, 40, 80, 160),  probability  of  change  was  1%,  and  probability  of  death  was  25%.  Due  to  the  exponential 
running  time  of  our  strategy  generating  algorithm,  this  is  the  largest  environment  (i.e.  smallest  probability 
of  death,  highest  number  of  actions  and  action  values)  for  which  we  could  compute  full  strategies. 

9.2.1  Convergence  and  Consistency  of  CSLA 

We  developed  a  Java  implementation  of  Algorithm  2  that  allows  us  to  specify  the  type  of  game  to  be  used 
for  the  simulation  in  Step  7  We  then  created  two  strategies,  ,vsei r  and  sevc-  The  training  process  for  both 
strategies  began  with  so,  the  best-response  to  a  random  ^Qbs  distribution,  and  continued  by  constructing 
a  strategy  s,+ 1  as  a  best-response  to  the  /robs  generated  by  simulating  games  involving  Sj.  When  training 
.s'seir  the  simulated  games  consisted  solely  of  agents  using  s,,  but  while  training  sevc  they  consisted  of  a 
population  of  agents  using  st  being  invaded  by  EVChooser.  In  both  cases,  100  games  were  simulated  at 
each  step  of  the  iteration,  to  limit  the  amount  of  noise  in  the  nobs  that  was  extracted  from  the  simulations. 

While  we  have  no  theoretical  guarantees  that  the  strategies  produced  by  Algorithm  2  will  converge,  the 
algorithm's  similarity  to  policy  iteration  [18]  led  us  to  suspect  that  the  they  would  converge.  Also,  since 
CSLA  is  greedy,  i.e.,  it  selects  the  best  response  strategy  at  each  step  of  the  iteration,  we  were  interested  in 
seeing  whether  the  strategy  it  found  represented  a  local  maximum  or  a  global  one. 

We  designed  a  simple  experiment  to  see  how  these  issues  would  play  out  when  generating  yseif  and  sevc: 
we  modified  the  program  to  use  a  randomly-generated  distribution  for  the  initial  value  of  nobs,  rather  than 
always  initially  setting  /robs  =  ^inv  as  is  done  in  Algorithm  2,  and  we  used  this  modified  program  to  generate 
100  alternate  versions  of  ,vse[f  and  sevc-  We  then  compared  these  alternates  to  the  original  sseif  and  sevc 
using  stratDilf.  In  the  case  of  yseif,  we  found  that  all  100  alternate  versions  were  identical  to  the  original.  In 
the  case  of  sevc>  we  found  that  58  alternate  versions  were  identical  to  the  original,  and  the  rest  exhibited  a 
stratDilf  of  no  more  than  1.08  x  1 0‘4.  This  means  that  an  agent  using  an  alternate  version  of  sevc  would 
choose  all  the  same  actions  as  one  using  the  original  sevc  at  least  99.989%  of  the  time.  This  tells  us  that  not 
only  does  Algorithm  2  converge  for  the  environment  we  are  testing  it  in,  it  converges  to  the  same  strategy 
each  time  it  is  run.  This  suggests  that  the  algorithm  is  finding  a  globally-best  solution  for  this  environment, 
rather  than  getting  stuck  in  a  local  maximum. 

Finally,  to  get  an  idea  of  how  different  sseif  and  sevc  are,  we  calculated  stratDi  (f(.vseir,  .vhvc)  and  found 
it  to  be  0.27.  This  means  that  training  a  strategy  against  an  external,  fixed  strategy  in  Algorithm  2  does 
produce  significantly  different  results  than  training  a  strategy  against  itself.  For  a  more  in-depth  look  at 
where  jseif  and  sevc  differ,  see  Section  9.2.4. 

9.2.2  Pairwise  Competitions:  ,vsei r  vs.  EVChooser  and  sevc  vs.  EVChooser 

We  played  both  of  our  generated  strategies,  vseir  and  sevc>  against  EVChooser  for  20,000  games  -  in  10,000 
games,  our  strategy  was  defending  against  an  invading  population  of  EVChooser  agents,  and  in  10,000 
games  the  roles  were  reversed,  with  our  strategy  invading  and  EVChooser  defending.  We  recorded  the 
population  of  each  strategy  on  every  round,  as  well  as  the  winner  of  every  game. 1 1  The  populations  in  an 
individual  game  were  extremely  noisy,  as  seen  in  Figure  3(e),  however  by  averaging  the  populations  over 

10 discountmachine  differs  from  EVChooser  largely  because  it  modifies  the  expected  value  of  Observing  using  a  machine-learned 
function  that  accounts  for  observe  actions  being  unreliable  and  returning  multiple  actions,  neither  of  which  are  possible  in  our 
version  of  the  game 

11  Recall  that  the  winner  of  a  Cultaptation  game  is  the  strategy  with  the  highest  average  population  over  the  last  quarter  of  the 
game. 


31 


a)  EVChooser  invading  sseif  b)  .vseif  invading  EVChooser 


c)  EVChooser  invading  sevc  d)  sevc  invading  EVChooser 
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e)  Population  of  sseif  at  each  round,  in  a  single  game  against  EVChooser 


Figure  3:  Average  populations  of  both  strategies  for  each  round,  in  match-ups  between  ,vse|  r  and  EVChooser 
(parts  a  and  b)  and  between  sevc  and  EVChooser  (parts  c  and  d),  over  10,000  games.  From  round  2000 
onwards,  sseif  or  sevc  control  57%  of  the  population  on  average,  regardless  of  whether  EVChooser  was 
invading  or  defending.  Since  mutation  is  enabled  from  round  100  onwards,  populations  in  an  individual 
game  (exhibited  in  part  e)  arc  highly  mercurial  and  do  not  converge.  Therefore,  we  must  run  a  large  number 
of  trials  and  average  the  results  to  get  a  good  idea  of  each  strategy’s  expected  performance. 
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Table  6:  Win  percentages  of  sseif  and  sevc  when  playing  against  EVChooser  over  10,000  games  as  both 
Defender  and  Invader. 


Win  percentage 

Defending  vs.  EVChooser 

Invading  vs.  EVChooser 

S  self 

70.65% 

70.16% 

SEVC 

69.92% 

69.92% 

Table  7 :  Percentage  of  games  won  (out  of  10,000)  by  sseif ,  ^evc*  and  EVChooser  in  a  melee  contest  between 
all  three. 


Velf 

SEVC 

EVChooser 

Melee  win  percentage 

38.78% 

37.38% 

23.84% 

all  10,000  games  we  can  see  some  trends  emerge.  These  average  populations  for  each  strategy  in  all  four 
match-ups  are  presented  in  Figure  3(a-d),  while  the  win  rates  for  each  match-up  arc  presented  in  Table  6. 

In  Figure  3  we  see  that,  on  average,  the  strategies  generated  by  Algorithm  2  control  roughly  57%  of  the 
population  for  the  majority  of  the  game  in  all  four  match-ups.  Interestingly,  both  .sseif  and  sevc  are  able  to 
reach  this  point  in  roughly  the  same  amount  of  time  whether  they  are  invading  or  defending.  It  is  also  worth 
noting  that,  even  though  we  showed  above  that  ,vse|f  and  sevc  have  significant  differences,  they  performed 
almost  identically  against  EVChooser  in  terms  of  population  and  win  percentages 

9.2.3  Melee  Competition:  ,vsei r  vs.  .vevc  vs.  EVChooser 

Our  next  experiment  was  to  run  ,vsei f  ,  .V[.vc,  and  EVChooser  against  one  another  in  a  melee  contest  to  see 
how  the  three  strategies  would  interact  in  an  environment  where  none  of  them  originally  had  the  upper  hand. 
All  three  strategics  had  an  initial  population  of  33  agents  at  the  start  of  each  game.  We  used  the  same  7T|nv, 
probability  of  change,  and  probability  of  death  as  in  Experiment  2.  Mutation  was  disabled  for  the  final  2,500 
rounds  of  each  melee  game,  as  was  done  in  the  Cultaptation  tournament  to  allow  the  population  to  settle. 
We  ran  10,000  games  in  this  manner,  and  percentage  of  wins  for  each  strategy  are  shown  in  Table  7. 

In  the  table  we  can  see  that  ,s'seif  has  a  slight  edge  over  .yevo  and  both  these  strategies  have  a  significant 
advantage  over  EVChooser.  In  fact,  we  observed  that  in  the  first  100  rounds  of  most  games  (before  mu¬ 
tation  begins)  EVChooser  nearly  died  out  completely,  although  it  is  able  to  gain  a  foothold  once  mutation 
commences.  Mutation  is  also  turned  off  after  7500  rounds  in  Cultaptation  melee  games;  this  caused  the 
population  to  quickly  become  dominated  by  one  of  the  three  strategies  in  ah  10,000  games  played. 

9.2.4  Performance  Analysis  of  ,vseir,  yevc?  and  EVChooser 

In  the  experiments  in  Section  9.2.2,  we  saw  that  the  strategies  found  by  CSLA  consistently  outperform 
EVChooser  in  environments  similar  to  the  ones  they  were  trained  in.  In  order  to  get  a  better  idea  of  why 
this  happens,  we  ran  two  experiments  to  compare  the  performance  of  .vseir  and  EVChooser  in  more  detail. 
The  first  was  designed  to  show  us  the  kinds  of  situations  in  which  the  two  strategies  chose  different  actions, 
while  the  second  was  designed  to  let  us  see  how  well  the  two  strategies  were  able  to  spread  good  actions 
through  their  population. 

Action  Preferences  The  objective  of  this  experiment  was  to  identify  the  kinds  of  situations  in  which  ,vseif, 
.V].;vC'  and  EVChooser  made  different  choices.  To  this  end,  we  allowed  ,vse| f  to  play  against  itself  for  five 
games,  in  an  environment  identical  to  the  one  used  for  the  previous  experiments  in  Section  9.2  (note  that 
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100 


a)  .s’seir  innovates  b)  sevc  innovates  c)  EVChooser  innovates 


d)  s'seif  observes  e)  sevc  observes  f)  EVChooser  observes 


g)  Veif  exploits  h)  sevc  exploits  i)  EVChooser  exploits 

Figure  4:  The  observed  probability  that  vseif,  sevc>  and  EVChooser  will  innovate,  observe,  or  exploit  when 
they  arc  a  given  number  of  rounds  old  (on  the  x-axis)  and  with  a  given  value  of  the  best  action  in  the  agent’s 
repertoire.  These  results  were  observed  by  allowing  each  strategy  to  play  itself  for  five  games  of  10,000 
rounds  each  with  100  agents  alive  on  each  round,  generating  a  total  of  5,000,000  samples.  All  graphs  in  this 
figure  share  the  same  legend,  which  is  included  in  graph  c)  and  omitted  elsewhere  to  save  space. 
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this  is  the  same  environment  sseif  was  trained  in).  On  each  round,  for  each  agent,  we  recorded  the  number 
of  rounds  the  agent  had  lived,  the  value  of  the  best  action  in  its  repertoire,12  and  whether  the  agent  chose  to 
innovate,  exploit,  or  observe  on  that  round.  Since  there  arc  100  agents  alive  at  any  given  time  and  each  game 
lasts  oo  rounds,  this  gave  us  five  million  samples.  Figures  4(a),  (d),  and  (g)  show  the  observed  probability 
that  sseif  would  innovate,  observe,  or  exploit  (respectively)  for  its  first  ten  rounds  and  for  each  possible  best 
action  value.  We  then  repeated  this  process  for  sevc  and  EVChooser,  allowing  each  strategy  to  play  against 
itself  for  five  games  and  recording  the  same  data.  The  results  for  sevc  and  EVChooser  may  be  found  in 
Figures  4(b),  (e),  and  (h),  and  Figures  4(c),  (f),  and  (i),  respectively. 

The  most  obvious  difference  among  the  three  strategies  is  that  EVChooser  almost  never  innovates,13  a 
property  it  shares  with  the  strategies  that  did  well  in  the  Cultaptation  tournament  [34] .  On  the  other  hand, 
.s’seir  and  sevc  have  conditions  under  which  they  innovate  and  conditions  under  which  they  do  not.  For 
instance,  both  .vseif  and  sevc  always  innovate  if  their  first  action  (which  is  always  an  observation)  returns  no 
action.  Also,  sevc  frequently  innovates  if  it  is  stuck  with  the  worst  action  after  several  observes,  and  ,s'seir 
also  innovates  (although  less  frequently;  see  next  paragraph)  in  this  case.  Another  sharp  contrast  between 
EVChooser  and  the  generated  strategies  is  in  their  exploitation  actions.  EVChooser  spends  nearly  all  of  its 
time  exploiting,  even  if  it  has  a  low-value  action,  and  only  observes  with  significant  probability  on  round 
two.  On  the  other  hand,  sseif  and  .vhvc  will  begin  exploiting  immediately  if  they  have  one  of  the  two  best 
actions,  but  otherwise  will  spend  several  rounds  observing  or  innovating  to  attempt  to  find  a  better  one,  and 
the  number  of  rounds  they  spend  searching  for  a  better  action  increases  as  the  quality  of  their  best  known 
action  decreases. 

The  main  difference  between  sseif  and  .vevc  that  can  be  seen  in  Figure  4  is  in  the  way  they  handle  being 
stuck  with  the  lowest-value  action  after  several  rounds.  In  these  circumstances,  .vse] r  prefers  observation 
while  .vevc  prefers  innovation.  Here  we  see  the  most  obvious  impact  of  the  differing  environments  used  to 
generate  these  two  strategies.  .vsei f  prefers  observation  in  these  cases  because  it  was  trained  in  an  environment 
where  all  agents  are  willing  to  perform  innovation.  Therefore,  if  an  .vsei r  agent  is  stuck  with  a  bad  action  for 
more  than  a  few  rounds  it  will  continue  to  observe  other  agents,  since  if  a  better  action  exists,  it  is  likely 
that  it  has  already  been  innovated  by  another  agent  and  is  spreading  through  the  population.  On  the  other 
hand,  .vevc  prefers  innovation  in  these  situations  because  it  has  been  trained  with  EVChooser  occupying  a 
significant  portion  of  the  population,  and  we  have  seen  that  EVChooser  almost  never  innovates.  Therefore, 
if  sevc  is  stuck  with  a  bad  action  after  several  rounds,  it  will  attempt  to  innovate  to  find  a  better  one,  since 
it  is  less  likely  that  another  agent  has  already  done  so. 

Spreading  High-value  Actions  The  objective  of  this  experiment  was  to  measure  the  rate  at  which  vseif, 
.vevC'  and  EVChooser  were  able  to  spread  high-valued  actions  through  their  populations.  To  measure  this, 
we  again  played  yseif  against  itself  in  the  same  environment  used  in  the  previous  experiment  (which  we  will 
refer  to  as  the  “normal”  environment  in  this  section),  and  on  each  round  we  recorded  the  number  of  agents 
exploiting  actions  with  each  of  the  four  possible  values  (20,  40,  80,  and  160).  To  account  for  the  noise 
introduced  by  changing  action  values,  we  ran  10,000  games  and  averaged  the  results  for  each  round.  We 
then  repeated  this  process,  playing  sevc  and  EVChooser  against  themselves.  The  results  for  .vse|f,  5evc>  and 
EVChooser  may  be  found  in  Figures  5(a),  (c),  and  (e)  respectively. 

This  experiment  lets  us  see  what  the  steady  state  for  these  strategies  looks  like,  and  how  quickly  they 
are  able  to  reach  it.  However,  we  are  also  interested  in  seeing  how  they  respond  to  structural  shocks  [10,  14] 
(i.e.  how  quickly  the  strategies  are  able  to  recover  when  a  good,  widely-used  action  changes  values).  To  this 
end,  we  created  a  “shock”  environment,  which  is  identical  to  the  normal  environment  with  one  modification: 
actions  with  value  160  have  a  probability  of  change  equal  to  0  except  on  rounds  divisible  by  100,  in  which 

12This  could  be  20,  40,  80,  160,  or  None  if  the  agent  had  not  yet  discovered  an  action 

l3EVChooser  innovates  1%  of  the  time  on  its  first  round. 
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Round  Round 


a)  sseif  in  normal  environment  b)  .vseir  in  shock  enviornment 


Round  Round 

c)  .V].;vc  in  normal  environment  d)  sevc  in  shock  enviornment 


Round 

e)  EVChooser  in  normal  environment 


Round 

f)  EVChooser  in  shock  enviornment 


Figure  5:  The  average  number  of  agents  exploiting  an  action  with  value  U  in  two  environments.  The 
“normal”  environment  in  parts  a,  c,  and  e  shows  how  quickly  .vseir,  sevc,  and  EVChooser  spread  actions 
through  their  population  under  normal  circumstances  when  they  control  the  entire  population.  The  “shock” 
environment  in  pa  its  b,  d,  and  f  shows  how  quickly  each  strategy  responds  to  periodic  structural  shock. 
The  “normal”  environment  is  the  same  as  in  the  rest  of  Section  9.2,  and  the  “shock”  environment  is  similar 
except  that  actions  with  value  160  are  forced  to  change  every  100th  round  and  held  constant  all  other  rounds. 
Each  data  point  is  an  average  over  10,000  games. 
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case  they  have  probability  of  change  equal  to  1.  All  other  actions  use  the  normal  probability  of  change 
for  this  environment,  0.01.  This  modification  creates  a  shock  every  100  rounds,  while  still  keeping  the 
expected  number  of  changes  the  same  for  all  actions.  We  then  repeated  the  experiment  above  with  the  shock 
environment,  running  10,000  games  for  vseif ,  sevc,  and  EVChooser  and  averaging  the  results,  which  are 
presented  in  Figures  5(b),  (d),  and  (f)  respectively. 

In  Figure  5  we  can  see  that  5seif  and  sevc  exhibit  nearly  identical  performance  in  both  the  normal  and 
shock  environments.  In  the  normal  environment,  they  are  able  to  reach  their  steady  state  in  only  a  few 
rounds,  and  the  steady  state  consists  of  a  roughly  equal  number  of  agents  exploiting  the  best  and  second- 
best  action.  In  the  shock  environment,  we  see  that  sseif  and  sevc  respond  to  external  shock  by  drastically 
increasing  the  number  of  agents  exploiting  the  second-best  action  over  the  course  of  a  few  rounds,  and 
returning  to  their  steady  states  at  a  roughly  linear  rate  over  the  next  100  rounds.  The  number  of  .vsen  and 
sevc  agents  exploiting  the  two  worst  actions  remains  extremely  low  except  for  small  spikes  immediately 
after  each  shock. 

Compared  to  the  generated  strategies,  EVChooser’s  performance  appeal's  to  be  less  stable,  and  less  ro¬ 
bust  to  structural  shock.  In  the  normal  environment,  we  see  that  EVChooser  takes  hundreds  of  rounds  to 
reach  its  steady  state.  While  EVChooser’s  steady  state  does  include  more  agents  exploiting  the  best  action 
than  .s'sejf  and  sevc-  it  also  includes  a  significant  number  of  agents  exploiting  the  two  worst  actions.  In  the 
shock  environment,  we  see  that  changes  to  the  best  action  result  in  significant  increases  to  the  number  of 
EVChooser  agents  exploiting  the  other  actions,  including  the  two  worst  ones.  We  can  also  see  that  popula¬ 
tions  of  EVChooser  agents  take  a  lot  longer  to  return  to  normal  after  an  external  shock  than  populations  of 
.vse|  i  and  .Ye vc-  These  results  help  us  account  for  the  superior  performance  of  ,vseir  and  .vevc  over  EVChooser 
in  previous  experiments,  and  indicate  that  there  is  plenty  of  room  for  improvement  in  EVChooser  and  strate¬ 
gies  like  it. 

10  Conclusion 

In  this  paper,  we  have  obtained  several  results  that  we  hope  will  help  provide  insight  into  the  utility  of  inter¬ 
agent  communication  in  evolutionary  environments.  These  results  can  be  divided  into  two  main  classes: 
algorithms  for  computing  strategies  and  equilibria  for  Cultaptation  and  similar  games,  and  properties  of 
Cultaptation  found  by  examining  the  strategies  generated  by  the  algorithms. 

10.1  Algorithms 

Generating  near-best-response  strategies.  In  Section  7  we  described  an  algorithm  that,  for  any  set  S_ff 
of  available  strategies  other  than  our  own,  can  construct  a  nonstationary  strategy  s0pt  that  is  a  best  response 
to  the  other  strategies  in  an  infinite  Cultaptation  game.  The  algorithm  performs  a  finite-horizon  search, 
and  is  generalizable  to  other  evolutionary  games  in  which  there  is  a  fixed  upper  bound  on  per-round  util¬ 
ity  and  a  nonzero  lower  bound  on  the  probability  of  death  at  each  round.  In  Section  7.5  we  described  a 
state-aggregation  technique  that  speeds  up  the  algorithm  by  an  exponential  amount;  the  state-aggregation 
technique  is  generalizable  to  other  evolutionary  games  in  which  the  utilities  are  Markovian. 

Even  with  this  speedup,  computing  .vopL  is  not  feasible  on  large  instances  of  Cultaptation,  because  it 
requires  too  much  time  and  space.  But  the  same  algorithm  can  be  used  to  compute,  for  any  e  >  0,  a  much 
simpler  strategy  that  is  an  e-best  response  (i.e.,  its  expected  utility  is  within  e  of  s0pt’s).  This  computation 
takes  much  less  time  and  space. 

Computing  symmetric  near-Nash  equilibria.  In  Section  8  we  introduced  the  Cultaptation  Strategy 
Learning  Algorithm  (CSLA),  which  runs  the  strategy-generation  iteratively  in  order  to  compute  a  strategy 
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5 seif  that  is  a  symmetric  near-Nash  Equilibrium.  Our  experiments  in  Section  9.2  show  that  CSLA  converges 
to  .s'seif  in  just  a  few  iterations.  We  have  argued  that  .ssei  r  is  likely  to  be  evolutionarily  stable,  but  have  not  yet 
proved  this.  We  hope  to  prove  it  in  our  future  work. 

10.2  Properties  of  Cultaptation 

The  strategies  generated  by  our  algorithm  show  a  clear  preference  for  Observe  actions  over  Innovate  actions. 
The  same  bias  appeared  in  the  Cultaptation  tournament,  where  the  best  strategies  relied  almost  exclusively 
on  observation.  This  result  was  a  surprise  to  tournament  organizers  Rendell  et  al.  who  expected  to  see 
innovation  performed  more  frequently  among  winning  strategies  [34]. 

Probably  the  reason  for  the  tournament  organizers  surprise  was  that  the  existing  literature  on  social  learn¬ 
ing  in  evolutionary  games  has  suggested  that  learning  through  observation  might  have  only  limited  value 
in  games  like  Cultaptation.  Reasoning  about  a  slightly  simpler  model,  Rogers  concluded  that  populations 
would  still  retain  a  significant  portion  of  individual  learners  even  if  social  learning  was  allowed  [35].  Boyd 
and  Richerson  extended  Rogers’  result,  claiming  that  social  learning  would  not  improve  the  average  fitness 
of  a  population  in  any  evolutionary  scenario,  as  long  as  the  only  benefit  to  social  learning  was  that  it  avoided 
the  costs  of  individual  learning  [4]. 

In  contrast,  we  arc  able  show  that  .vsen  is  an  effective  strategy  despite  performing  mostly  Observe  actions. 
We  can  also  provide  specific  details  about  when  and  in  what  conditions  Innovate  becomes  preferable.  Rather 
than  relying  exclusively  on  Obsetye  actions,  .vse| (  will  innovate  if  an  agent  is  stuck  with  a  low-valued  action 
after  several  attempted  observations.  This  effect  is  even  more  prominent  in  sevc>  which  was  far-  more  likely 
than  ,vseir  to  perform  innovate  actions  if  the  first  few  observations  failed  to  produce  a  good  value.  Figure  4 
shows  a  noticeable  increase  in  the  number  of  Innovate  actions  performed  by  sevc  when  the  agent  still  has 
a  low  value  by  the  third  round  of  its  life.  Based  on  the  data  in  Figure  5,  we  know  that  the  quality  of  zrobs 
is  not  as  good  in  an  environment  with  EVChooser,  so  sevc  will  resort  to  innovation  both  more  quickly  and 
more  often  than  Weir- 

Rendell  et  al.  attribute  the  effectiveness  of  Obsetye  actions  in  the  tournament  to  the  “filtering”  process 
performed  by  other  agents.  Since  most  agents  will  act  rationally  by  exploiting  their  best  action,  an  Obsetye 
action  will  quickly  reveal  high  valued  actions  to  the  observer.  Our  results  show  a  similar  phenomenon, 
where  sseif  is  able  to  propagate  high-valued  actions  more  quickly  not  simply  because  it  exploits  its  best 
action,  but  also  because  it  hesitates  to  exploit  low-valued  actions  and  spends  that  time  learning  instead. 
By  not  polluting  ^Qbs  with  low-valued  exploits,  the  probability  that  a  high-valued  action  will  be  observed 
actually  increases.  The  consequence  of  this  is  shown  in  Figure  5,  where  high-valued  actions  propagate  very 
quickly  in  a  population  of  sseif  agents,  and  more  slowly  in  a  population  of  EVChooser.  The  graph  in  Figure  4 
shows  that  EVChooser  is  far  more  likely  than  .vseir  to  exploit  low-valued  actions. 

10.3  Future  Work 

One  limitation  of  this  work  is  that  we  were  unable  to  compare  sseif,  the  strategy  generated  by  our  CSFA, 
against  the  best-performing  agents  in  the  Cultaptation  tournament.  EVChooser,  the  strategy  that  we  used 
as  an  invader  in  Section  9.2,  placed  15th  in  the  Cultaptation  tournament.  In  the  future,  we  hope  to  get  the 
source  code  for  the  top-performing  agents  and  test  our  algorithms  against  them. 

Also  left  for  future  work  is  the  examination  of  information-gathering  in  the  social  learning  game.  An 
agent  in  the  game  would  not  normally  be  able  to  compute  a  best-response  strategy  as  we  have  done  in  this 
paper,  because  it  would  not  know  the  other  players’  strategies,  nor  the  probability  distributions  from  which 
the  innovation  and  observation  utilities  are  drawn.  Such  an  agent  would  need  either  to  approximate  the 
distributions  or  to  use  an  algorithm  that  can  do  well  without  them.  If  we  choose  to  approximate,  should 
our  agent  be  willing  to  sacrifice  some  utility  early  on,  in  order  to  gain  information  that  will  improve  its 
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approximation?  Arc  there  strategies  that  perform  well  in  a  wide  variety  of  environments,  that  we  could  use 
until  our  agent  develops  a  good  approximation?  Are  some  of  these  strategies  so  versatile  that  we  can  simply 
use  them  without  needing  to  know  the  distributions?  These  remain  open  questions. 
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A  Proofs 


Proposition  2  EPRU(s  |  G,  S )/d  =  EPRUait(s,  <>  |  G,  S). 

Proof. 

First,  we  will  show  by  induction  that  EPRUait(s,  ()  I  G,  S)  equals  the  summation  of 
P(ha\sa,  S)  EW  exp(\ha\,  U(ha[\ha\]))  for  all  histories  ha.  Then  we  will  show  that  this  equals  the  summation  of 
L(\ha\)P(ha\sa,  S)  PRUf/jQ,)  for  all  ha. 

We  will  begin  with  the  definition  of  EPRUait  in  Equation  13,  and  note  that  P(()  o  t|(),  sa{()),  S)  =  P(()  o 
S)  for  histories  of  length  one.  This  gives  us  a  base  case  of 

EPRIUhq,,  <>  |  G,  S)  =  ^  P(()  o  t\sa,  S)  PNexp{\,  U{t ))  +  Y  P(0  °  t\sa,  S)  EPRUaU(^,  <>  o  /  |  G,  S). 

teT  teT 

For  the  inductive  case,  we  will  again  start  from  Equation  13,  this  time  noting  that 
P(ha\sa,  S)P(ha  o  t\ha,  sa(ha),  S)  simplifies  to  just  P(lia  o  / 1 ,vr>, ,  S).  Thus,  for  all  ha  we  can  rewrite 
P(ha\sa,  S)  EPRUa|t(.v(>,  lia  |  G,  S)  in  terms  of  histories  one  round  longer  than  lia,  as  follows: 

P(ha\sa,  S)  EPRUaU(.sa,  ha  |  G,  S)  =  2  P(K  o  t\sa,  S)  EVexp(\ha  o  t\,  U{t )) 

teT 

+  ^  P(ha  O  t\sa,  S)  EPRUait(5ff,  ha  °t\G,  S)). 

teT 


Therefore,  by  induction  we  have 

EPRUalths'(,,  <>  |  G,S)  =  ^  P(ha\sa,S)EVexP(\ha\,  U(ha[\ha\])), 


h„eH 


where  H  is  the  set  of  all  possible  histories. 
The  proof  then  proceeds  arithmetically: 


EPRUalt(^,<>  |  G,S)  =  2  P(ha\sa,S)EVexp(\ha\,  U(ha[\ha\])) 

L(i)U(ha[\ha\ ]) 


h„eH 


Yj  P(ha\sa,S)  Y 

i=\K\ 


KeH 


i— \ha\ 


L(i)P(ha\sa,S)U(ha[\ha\\) 


z  z 

i=  1  haeH(<i) 


L(i)P(ha\Sa,S)U(ha[\ha\\) 


-Z  Z  L(i)P(ha\sa,  S)  PRU(/ja)  =  EPRUty,  |  G,S)/d 

;'=  1  haeH(i) 


Where  H(<  i )  is  the  set  of  all  histories  of  length  less  than  or  equal  to  i,  and  H(i)  is  the  set  of  all  histories 
exactly  of  length  i.  □ 

Proposition  3  Strat(ha,  k,  V,  S)  returns  {sa,  U)  such  that 

EPRUfllt(.sa,  ha  |  G,S)  =  U  =  maxs' (EPRU*lt(s/,Aa  |  G,  S)). 
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Proof.  Let  s'  be  a  strategy  maximizing  EPRU^  (V,  ha  |  G,S)  and  let  j.v„,  U\  be  the  strategy  and  value 
returned  by  Strat(/r„,  k,  V,  S).  We  will  show  by  induction  on  k  that 

EPRUy  S<r,  ha  |  G,  S)  =  U  =  EPRUfik(.v',  ha  \  G,  S). 

In  the  base  case,  k  =  0  and  clearly  EPRU^lt(V ,  liQ  |  G,  S)  =  0  for  any  s',  therefore  EPRU^t.v,,,  ha  \  G,  S)  = 
EPRLJ^i|t(.v',  ha  |  G,  S)  =  0  =  U  as  required. 

For  the  inductive  case,  suppose  that  for  k,  Strat(7?(>,  k,  V,  S)  returns  {sa,U}  such  that 
EPRU^lt(sff,  ha  |  G,  S)  =  U  =  max  ( EPR  U|kl(Y ,  h,r  |  G,S)).  We  must  then  show  that  Strat  (ha,k  +  l,V,  S) 
returns  { sa,U}  such  that 

EPRU^1  (sa,  ha  \G,S)  =  U  =  maXs'(E?RGk2'(s' ,ha  \  G,  S)). 

Let  ^temp  be  the  strategy  constructed  in  lines  8-19  of  the  algorithm.  First  we  show  that  on  line  20, 

EPRU^k  1  (,s-temp,  ha  |  G,  S)  = 

Y  P(ha  o  t\ha,  stemp(ha),  S)  (EVexP(\ha\,  U(t ))  +  EPRUklt(stemp,  K  o  t  \  G,  S))  =  £/temp-  (20) 

teT 

This  follows  because  the  t  on  line  1 1  iterates  over  all  possible  teT  (due  to  the  for  loops  on  lines  6,  9,  and 
10),  meaning  that  the  eventual  value  of  I/Lcmp  is 

Y  P(h<*  °  t\ha,  slemp(ha)\S)  (EV„P(| lia  o  A,  U(t))  +  U')  . 

teT 

By  the  inductive  hypothesis,  U'  =  EPRU^Utemp,  ha  °  t\G,  S),  sufficing  to  show  that  (20)  holds. 

Now  we  show  that 

EPRU^f sa,ha  |  G,  S)  =  EPRU kJt\s',ha  \  G,  S). 

Clearly  EPRU|‘k  1  ha  |  G,  S)  <  EPRU^'lY,  ha  \  G,  S),  since  s'  is  assumed  to  have  maximal  EPRU^ 1  for 
ha,  so  it  suffices  to  show  that 

EPRU k^\sa,ha  |  G,S)  >  EPRU**1  (s' A,  |  G,S). 

Since  sa  maximizes 

Y  P(ha  o  Aha,  Sa(ha),  S)(EVexp(\ha  o  t\,  U{t))  +  U'), 

teT 

where  U'  >  EPRU^lY,  lia  o  t  \  G,  S)  by  the  inductive  hypothesis,  there  can  be  no  action  a  such  that 
YjPOla  O  Aha,  a,  S)(EV„„(I  ha  o  A,  U(t ))  +  EPRU^fU,  ha  ot\G,  S)) 

teT 

>  Yj  P(ha  o  Aha,  Aha),  S)(EYexp(\ha  o  t\,  U{t))  +  U'). 

teT 

Therefore  EPRU^1  (sa,  ha  |  G,S)  >  EPRU^'lV,/?,*  |  G,  S).  This  concludes  the  inductive  argument. 

Thus  for  all  k,  EPRUfikU„,  ha  \  G,  S)  =  U  =  max,.  EPRU*k(T,  ha  I  G,  S).  □ 
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Algorithm  3  Creates  and  returns  a  repertoire  for  history  lia. 
CrcatcRcpcrtoirc(7i„  =  (ai,  (mi,  vi)), . . . ,  (, ar ,  (mr,  v,-))) 

Let  p  =  {m,|;  -  1, . . . ,  r}. 

Let  R  =  (ft  {R  will  be  the  repertoire.} 

for  m  e  p  do 

Let  i  =  max  (m;  -  m) 

lh-.d 

Add  (vi,  r  -  i)  to  R. 

end  for 
return  R 


B  Converting  from  Histories  to  Repertoires 

In  this  appendix,  we  will  formally  define  a  repertoire,  explain  how  to  transform  histories  into  repertoires,  and 
show  that  the  number  of  possible  repertoires  is  substantially  smaller  than  the  number  of  possible  histories, 
while  maintaining  the  property  that  any  best-response  action  for  a  given  repertoire  is  also  a  best  resonse 
for  any  history  associated  with  that  repertoire  (Theorem  5).  Finally,  we  will  present  a  modified  version  of 
Algorithm  1,  which  uses  repertoires  rather  than  histories,  and  we  will  show  how  this  simple  change  cuts  the 
branching  factor  of  the  algorithm  in  half  (Algorithm  4). 

B.0.1  Repertoire  Definition 

A  repertoire  tells  the  last  value  and  age  of  each  action  an  agent  “knows,”  where  an  action’s  age  is  the  number 
of  rounds  that  have  passed  since  the  agent  last  obtained  information  about  it.  Since  at  any  given  point  in  a 
game,  each  known  action  has  a  unique  age,  we  label  exploitation  actions  by  their  value  and  age,  leaving  off 
the  action  number  (e.g.  if  we  discovered  an  action  with  value  4  last  round  and  an  action  with  value  26  three 
rounds  ago,  then  the  repertoire  will  be  {(4, 1),  (26, 3)}  where  (4, 1)  denotes  the  existence  of  an  action  with 
value  4  discovered  1  round  ago,  and  (26, 3)  denotes  the  existence  of  an  action  with  value  26  discovered  3 
rounds  ago).  Formally,  a  repertoire  is  defined  to  be  a  set  of  pairs,  where  the  first  value  in  each  pair  represents 
the  knowledge  of  an  action  with  the  given  value,  while  the  second  value  in  the  pair  represents  the  number  of 
rounds  since  that  knowledge  was  last  updated. 

Definition.  Let  v\, . . .  ,vm  e  V  be  action  values  and  yi, . . .  ,ym  £  Z+  (the  positive  integers)  be  action  ages. 
A  repertoire  R  is  a  set  of  action  value/action  age  pairs  R  =  {(vi,yi), . . . ,  ( vm , ym)j.  We  denote  the  set  of  all 
repertoires  as  Rep,  and  the  set  of  all  repertoires  where  all  y,-  <  j  as  Rep j.  □ 

Rep  has  unbounded  size,  but  Rep  j  has  finite  size.  We  show  how  to  create  a  repertoire  R  from  a  history 
ha  using  the  CreateRepertoire  function  in  Algorithm  3. 

Repertoires  change  based  on  the  action  performed.  For  example,  repertoire 

R  =  {(4, 1>,  (26, 3>} 

can  change  to  repertoire 

R'  =  1(4, 2>,  (26, 4>,  (27, 1>) 

after  an  innovation  action  where  an  action  with  value  27  is  innovated.  Notice  that  all  actions  in  R' ,  apart 
from  the  newly-innovated  action  with  age  1 ,  are  one  round  older  than  they  were  in  R.  This  aging  process 
occurs  often  enough  for  us  to  introduce  a  function  which  ages  a  repertoire  R  =  f(v/,  y,»: 

age({(v,-,y,»)  =  f(v/,y,  +  1» 
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Finally,  we  will  introduce  two  functions  to  represent  the  two  ways  our  repertoire  can  change  when  we 
perform  an  action. 

The  first,  new  action,  returns  a  repertoire  with  a  new  action  added  to  it: 
newactionfi?,  v)  =  ag e(R)  U  {(v,  1)} 

The  second,  updaction,  returns  a  repertoire  with  updated  information  on  action  m: 
updactionfT?,  v,  m)  =  age(R  \  {<vm,ym>})  U  {<v,  1» 

B.0.2  Transition  Probabilities 

We  can  now  define  the  probability  of  transitioning  between  repertoires  on  round  r.  We  will  call  the  transition 
probability  functions  PRep(R'\R,  r,  a,  S)  for  a  e  (Inv,  Obs.X,}.  In  general,  these  functions  will  mirror  the 
P(h'\h,  a,  S)  functions  defined  in  Section  5,  with  some  extra  clauses  added  to  ensure  that  if  it  is  not  possible 
to  go  from  repertoire  R  to  repertoire  R'  using  the  given  action,  then  the  transition  probability  is  0. 


Innovation  actions  For  innovation  actions,  the  function  is: 


PRep(R'\R,  r,  Inv,  S) 


0  if  |/?|  =  //  V  $v  :  R'  -  newactionf/?,  v) 

^InvCvl?')  if  R'  =  newactionf/?,  v) 


The  first  clause  ensures  that  if  all  possible  actions  are  already  in  R,  or  if  it  is  not  possible  to  go  from 
R  to  R'  in  one  innovation  action,  then  the  transition  probability  is  0.  The  second  clause  simply  tells  us  the 
probability  of  innovating  an  action  with  value  v,  given  that  it  is  possible  to  go  from  R  to  R'  in  one  innovation 
action. 


Observation  actions  In  Section  5  we  assumed  the  existence  of  a  distribution  zrobs  that,  when  given  the 
current  history,  would  tell  us  the  probability  of  observing  an  action  with  a  given  value.  Here,  we  will  make 
the  following  assumptions  about  7Tobs: 

•  When  given  a  repertoire  and  round  number,  nQ^s(\’\R,  r,  S)  tells  us  the  probability  of  observing  an 
action  that  has  value  v  and  is  not  already  known  by  R. 


•  When  given  a  repertoire  and  round  number,  ^obs (»?,  v|/?,  r,  S)  tells  us  the  probability  of  observing  an 
action  that  has  value  v,  and  was  previously  in  R  at  position  m.  The  value  of  this  action  may  have 
changed. 


•  ^Obs  can  make  its  predictions  without  using  any  information  lost  when  converting  from  a  history  to  a 
repertoire. 


These  assumptions  are  all  satisfied  by  the  /robs  used  in  our  implementation,  and  we  expect  them  to  hold 
for  other  practical  implementations  as  well,  since  a  distribution  conditioned  on  entire  histories  would  be 
impractically  large. 

With  this  in  mind,  the  transition  probability  function  for  observation  actions  is 


(  TTQbsfm,  v|7?,  r,  S) 


PRep(R'\R,r,  Obs,S)  =  l 


^Obsfvl R,  r,  S) 
0 


if  Wm,7m)  e  R  A 

R'  =  updactionfi?,  v,  m) 
if  R'  =  newactionfi?,  v) 
Otherwise. 


The  first  clause  gives  us  the  probability  of  observing  an  action  already  in  our  repertoire,  while  the  second 
gives  us  the  probability  of  observing  a  new  action. 
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Exploitation  actions  Let  (v;,  y,)  be  the  value  and  age  of  exploitation  action  X,.  Then 


< 


PRep(R'\R,  r,  X;,  S)  =  { 


0  if  \R\  ±  |/?'| 

0  ifVv'  £  V,  R'  T  updaction(P,  V  ,i) 

ri/=r-y,(i  -  c(j))  +  Yij=r-yi  c(jXvhj)  (nyo  -  c(0)) 

if  R'  =  updaction(P,  v/,  i) 

I Yj=r-7i  cUMv'i,j)  (n«=/n  -  c(0)) 

if  R'  =  updaction(P,  v'-,  i)  and  v,  +  v' 


The  first  two  clauses  check  that  we  can,  in  fact,  transition  between  R  and  R'  by  exploiting.  The  third 
clause  gives  us  the  probability  that  the  action  we  exploited  has  not  changed  since  we  last  saw  it,  while  the 
fourth  clause  gives  us  the  probability  that  the  action  we  exploited  has  changed. 


B.0.3  Consistency  between  P  and  PRep 

Later  in  this  section,  we  will  show  that  using  a  repertoire-based  algorithm  to  compute  e-best-response  strate¬ 
gies  returns  the  same  results  as  using  the  history-based  Algorithm  1.  To  do  this,  we  will  use  the  notion  of 
consistency  between  the  P  and  PRep  equations. 

Definition.  Let  M  be  the  set  of  actions  known  to  an  agent  with  history  ha.  The  P  and  PRep  equations  arc 
consistent  for  lia  if,  for  all  a  e  { Inv,  Obs,  X,}  and  v  e  V: 

\R\ 

y  P(h  o  (a,  (m,  v))  |  h,  a,  S)  =  y  /^''AupdactiontA,  v,  m)\R,  r,  a,  S)  (21) 

meM  m—  1 

and 

^  P(h  o  (a,  (m,  v))  |  h,  a,  S)  =  PSf>p(newaction(P,  v)|P,  r,  a,  S)  (22) 

where  R  =  CreateRepertoire(/7)  and  r  =  | ha\.  □ 

Lemma  6  The  P  and  PRep  equations  are  consistent  for  all  h  £  H. 

Proof.  We  can  prove  this  by  using  the  definition  of  P,  found  in  Section  5,  and  the  definition  of  PRep 
found  above.  We  will  simply  consider  what  happens  for  arbitrary  lia  and  v  when  performing  innovation, 
observation,  and  exploitation  actions. 

Recall  that  X(h)  returns  the  number  of  exploit  moves  available  to  an  agent  with  history  h.  For  ease 
of  exposition,  we  will  assume  without  loss  of  generality  that  the  first  action  learned  by  lia  has  label  1,  the 
second  has  label  2,  etc.  Thus  M  =  {1, . . .  ,X(h)},  while  { 1 , . . .  ,/u)  \  M  -  {X(h)  +  1, . . .  ,q}. 


Innovation  actions  An  innovation  action  always  returns  information  on  a  new  action,  so  both  sides  of 
Equation  21  are  clearly  0  in  this  case.  If  X(h)  =  |A'|  =  p,  both  sides  of  Equation  22  arc  also  0  since  no  new 
actions  can  be  innovated.  Thus,  we  will  assume  X{h)  =  |A'|  <  //.  We  now  have 

V  P(h  O  (Inv,  (m,  v))| h,  Inv,  S)  =  (jj.  -  X(h)) 7r|nv^ =  n\m(v\r) 
m=Mh)+ 1  ^~X(h) 

and 

pRep(newaction(P,  v)|P,  r,  Inv,  S)  =  n-|nv(v|r) 
which  are  clearly  equivalent.  Hence,  P  and  PRep  arc  consistent  on  h„  when  a  =  Inv. 
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Observation  actions  This  section  of  the  proof  is  mostly  trivial,  given  the  assumptions  we  have  made  about 
7T0bs-  The  left  side  of  Equation  21  is 

X(h)  X(h) 

X  P(h  o  (Obs,  (m,  v))\h,  Obs,  S)  =  X  7TObs(,?b  v| h,  S) 

m=  1  m= 1 

which  tells  us  the  probability  of  observing  one  of  the  actions  already  seen  in  our  current  history.  The  right 
side  is 

1*1  1*1 

X  p*e*(Updaction(/?,  v,  m)\R,  r,  Obs,  S)  =  X  ^Obs(W7’  v|/?,  r,  S) 

m= 1  m= 1 

Since  we  assume  that  ^obs (m>  v|/?,  S)  tells  us  the  probability  of  observing  action  m  in  the  repertoire, 
lyL]  -^Obsbn,  v\R,  r,  S)  also  tells  us  the  probability  of  observing  any  of  the  actions  we  have  already  seen. 
Therefore,  Equation  21  holds  for  observation  actions. 

For  Equation  22,  the  left  side  is 

M  f 

X  P(h  o  (Obs,  (m,  v))| /?,  Obs,  S)  -  X  nobs(m>  v\h,  S) 

m=X(h)+l  m=X(h)+ 1 

which  tells  us  the  probability  of  observing  an  action  we  have  not  yet  seen.  The  right  side  is 

E/'’'7’(ncwaction(/v’,  v)| R,  r,  Obs,  S)  =  zrobsi l’l^-  T  S) 

Since  we  assume  that  7Tobs(vl^'  S)  gives  us  the  probability  of  observing  a  new  action.  Equation  22  also 
holds  for  observation  actions.  Hence,  P  and  PRep  arc  consistent  on  ha  when  a  =  Obs. 


Exploitation  actions  Exploiting  an  action  never  gives  us  information  about  a  new  action,  so  both  sides  of 
Equation  22  arc  0  when  we  exploit.  Thus,  we  need  only  consider  Equation  21. 

We  will  consider  two  cases.  In  the  first  case,  the  action  we  choose  to  exploit  has  changed  since  we  last 
saw  it,  so  v  is  a  new  value.  We  then  have 

X(h)  X(h)  r  r 

P(h  O  (Xm,  (in,  v))| h,  X,„,  S)  =  Z  .2  c(j)n(v,  j)  P[  I  -  c(i) 

m=  1  m=  1  j=lastm  i=j 


where  last,,,  is  the  last  round  number  on  which  we  obtained  any  information  about  the  m-th  action. 
Similarly,  since  exploiting  never  increases  the  size  of  a  repertoire,  we  have 


1*1 

y]  /)ffep(updaction(/?,  v,  m)\R,  r, 

m- 1 


1*1  r 

xm,s)  =  y]  y] 


m=  1  j=r-ym 


c(j)n(v,  j )  n<>  -  c(0) 

i=j 


Since  |A'|  =  X(h)  and  last,,,  -  r  -  y,„  by  definition.  Equation  21  holds  when  the  action  we  exploit  changes. 
Next,  we  consider  the  case  where  the  action  we  choose  to  exploit  has  not  changed.  In  this  case,  we  have 


X(h)  X(h) 

2  P(h  o  (Xm,  (m,  v))| h,  Xm,  S)  =  y] 


m- 1 


m=  1 


r 

j  ]  (i  -<(./))- 

J=lastm 


r  r 

1  -  c(i) 

j=lastm  i=j  , 


and 


1*1  1*1 

X  ERep(updaction(/?,  v,  m)\R,  r,  Xm,  S)  =  X 

m=  1  m= 1 


r 

n  (!  -  co)) + 

J=r-ym 


r  r 

X  c(j)Mv,j)Y](\  -  c(i )) 

j=r~Vm  i=j  , 
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which  arc  also  equivalent.  Therefore,  Equation  21  also  holds  when  the  action  we  exploit  does  not 
change. 

We  have  now  shown  that  P  and  PRep  arc  consistent  on  arbitrary  ha  when  a  e  {Inv,  Obs,  X, }  and  for 
arbitrary  v.  Therefore,  P  and  PRep  arc  consistent  for  all  h„.  □ 

B.0.4  Repertoire-based  Strategies 

Repertoires  can  be  used  to  more  compactly  define  a  strategy.  We  let  a  repertoire-based  strategy  s  be  a 
function  from  repertoires  to  actions.  Such  a  strategy  can  be  represented  more  compactly  than  the  history- 
based  strategics  used  earlier  in  this  paper,  since  there  arc  fewer  possible  repertoires  than  there  arc  possible 
histories.  In  any  history  ha,  a  repertoire -based  strategy  s  chooses  the  action  associated  with  repertoire 
CreateRepertoiref/r). 

We  can  use  the  PRep  functions  to  define  a  formula  that  determines  the  EPRU  for  any  repertoire -based 
strategy  s.  EPRUak(s,  R,  r  \  G,  S)  is  a  recursive  function  for  calculating  the  expected  per-round  utility  of  s: 

EPRUait(s,  R,r\G,  S) 

=  ^  ^  [Pfep(newaction(R,  v)| R,  r,  a,  S)x 

as  A  veV 

(EVexp(r,  U((a,  (-,  v))))  +  EPRUak(.v,  newactionfR,  v),  r  +  1  |  G,  S)) 

|R| 

+  ^  Ps<>p(updaction(R,  v,  m)\R,  r,  a,  S)x 

m- 1 

(E Nexp{r,  U((a,  (-,  v))))  +  EPRUak(.v,  updactionfR,  v,  m),  r  +  1  |  G,  S))] 

where  A  is  the  set  of  possible  actions  and  V  is  the  set  of  possible  action  values. 

However,  like  EPRUak(.v,  h  \  G,  S),  EPRUak(.v,  R.  r  \  G,  S)  contains  infinite  recursion  and  is  therefore  not 
computable.  We  will  deal  with  this  problem  as  we  did  in  Section  6.1,  by  introducing  a  depth-limited  version. 
For  ease  of  exposition  we  will  introduce  two  ’’helper”  functions 

EPRU\  (s,R,r,a,v  \  G,  S) 

dlLnew 

=  PSe/,(newaction(R,  v)|R,  r,a,  S)x 

[E Vexp(r,  U((a,  (-,  v))))  +  EPRU^1  (.v,  newactionfR,  v),  r  +  1  |  G,  S)] 
and 

EPRUL  (5,  R,  r,  a,  v  |  G,  S) 

dllupd 

\R\ 

=  ^pfeP(updaction(7?,  v,  m)\R,  r,  a,  S)x 

m=  1 

[EVexp(r,  U((a,  (-,  v))))  +  EPRUfi”'(.v,  updactionfR,  v,  m),  r  +  1  |  G,  S)] 

Now  we  can  define 


0, 


if  k  =  0, 


EPRU^lt(5,R,r|G,S)  =  l 


J]J](EPR\JlKJs,R,r,a,v\G,S)  + 

aeA  veV 

EPRU^  (s,R,r,a,v  \  G,  S)), 

dil-upd 


otherwise. 


A  proof  that  this  formulation  is  equivalent  to  the  version  of  EPRU^,  from  Section  6.1  follows. 
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Theorem  5  For  all  histories  1ia,  all  repertoire-based  strategies  s,  and  all  k  >  0,  if  s'  is  a  function  from 
histories  to  actions  where  s'(h)  -  s(  Create  Re  perloirei  h))  and  r  -  \ha\,  then  EPRU^t.v,  R,  r  \  G,  S)  = 
EPRU*lt(Y ,  h  |  G,  S). 

Proof.  We  can  prove  this  by  using  induction  on  k.  For  our  base  case,  we  will  use  k  =  0,  since 
EPRU°t(s,R,  r  |  G,  S)  =  EPRU°t(V,  h  \  G,  S)  =  0  by  definition. 

For  the  inductive  step,  we  will  assume  that  Theorem  5  holds  for  some  k  >  0,  and  show  that  it  also  holds 
for  k  +  1.  Recall  from  Section  6.1  that,  in  this  case 

EPRUf+'tY,  h  |  G,  S)  =  P(h  o  t\h,  s(h),  S)(EV  exp(r,  U(t ))  +  EPRU^O',  hot\  G,  S)) 

teT 

Since  T  is  simply  the  set  of  all  action-percept  pairs,  we  can  instead  write  this  as 

EPRU^V,/z|G,S)  = 
r 

Z  ZiX>  o  (a,  (m,  v))| h,  s(h),  S) 
aeA  veV  m=  1 

(E Nexpir,  U({a,  (m,  v))))  +  EPRU kalfs',  h  o  (a,  (m,  v))  |  G,  S))] 
where  A  is  the  set  of  possible  actions  and  V  the  set  of  possible  values.  We  also  have14 
EPRU^ts,  R,r\G,S)  = 

^  ^ [  PRep(newaction( R.  v)|R,  r,  a,  S) 

aeA  veV 

(E Vexp(r,  U((a,  (-,  v))))  +  EPRU';it(.v,  newaction(R,  v),  r  +  1  |  G,  S)) 

m 

+  ^  PSep(updaction(R,  v,  m)\R,  r,  a,  S) 

m=  1 

(E  Nexpir,  U((a,  (-,  v))))  +  EPRU^t.v,  updaction(R,  v,  m),  r+l\G,  S))] 

Recall  from  Lemma  6  that  P  and  PRep  are  consistent  on  ha.  We  can  combine  this  with  our  inductive 
hypothesis  to  show  that  the  bracketed  portions  of  the  two  equations  above  are  equal. 

Recall  that  when  a  repertoire  encounters  a  new  action,  it  does  not  store  the  action  number  m  for  that 
action.  Thus,  for  any  pair  m  and  m!  that  arc  not  already  in  /;„,  we  know  that  C  re  at  c  R  c  pc  rto  i  re  ( h  o(a,  ( m ,  v)))  = 
CreateRepertoire(/i  o  (a,(m',v)))  =  newaction(R,  v).  Therefore,  by  our  inductive  hypothesis 

EPRU^tV,  h  o  (a,  (; m ,  v))  |  G,  S)  =  EPRUfik(.v,  newaction(R,  v),  r  \  G,  S) 

for  all  m  not  already  in  liQ.  Notice  that,  by  definition,  there  arc  p  -  X(h)  values  of  m  that  arc  not  already  in 

lia.  If  we  assume  without  loss  of  generality  that  the  first  action  learned  in  lia  has  label  1,  the  second  has  label 

2,  etc.,  then  we  can  define  /3new  to  be  the  quantity  P  and  PRep  arc  multiplied  by  when  we  learn  something 
new: 

Pnew  =  E Yexp(r,  U((a,  (-,  v))))  +  EPRU*  t(^,  h  o  (a,  X(h)  +  1,  v)  |  G,  S) 

=  E Nexpir,  U((a,  (-,  v))))  +  EPRU^lt(/,  h  o  (a,  X(h)  +  2,  v>  |  G,  S) 


=  E Vexp(r,  U((a,  (-,  v))))  +  EPRU^l.v',  h  o  (a,  (p,  v))  |  G,  S) 

=  E Nexpir,  U((a,  (-,  v))))  +  EPRU';k(.v,  newaction(R,  v),  r  \  G,  S) 

14Recall  that  function  U  simply  calculates  the  utility  of  performing  the  given  action,  and  does  not  depend  on  the  action  number. 
Thus  U(a,  m,  v)  =  U(a,  v )  for  any  legal  m. 
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Similarly,  the  inductive  hypothesis  also  tells  us  that 


EPRU*  t(V ,  h  o  (a,  (m,  v))  |  G,  S)  =  EPRU*tO,  updaction(R,  v,  m),  r  \  G,  S) 

for  all  m  that  arc  already  in  ha.  Thus,  we  can  also  define  to  be  the  quantity  P  and  PRep  arc  multiplied  by 
when  we  update  our  information  on  action  m: 

Pm  =  E Vexp(r,  U((a,  (-,  v))))  +  EPRU*  t(s\  h  o  (a,  (m,  v))  |  G,  S) 

=  E Vexp(r,  U((a,  (-,  v))))  +  EPRUflkU,  updactionfR,  v,  m),  r  |  G,  S) 


for  m  =  1, . . . ,  X(h). 

We  can  now  rewrite  EPRU*k  1  (Y,  h  \  G,  S)  and  EPRU^1  (.v,  R,  r  \  G,  S)  as 

EPRU^V,  h  |  G,  S)  =  P(h  o  (a,  (m,  v))| h,  s(h ),  S)^ 

aeA  veV  m=X(h)+l 
X(h) 

+  ^  P(/?  o  (a,  (m,  v))|/?,  s(h),  S)/3,„] 

m=l 


and 


EPRU^'U,  R,  r  |  G,  S)  =  £  ^[P^(newaction(R,  v)|R,  r,  a,  S)/3„w 

aeA  veV 

|R| 

+  ^  PAV/,(updaction(P.  v,  m)\R,  r,  a,  S)pm  I 

m—  1 

Equations  21  and  22  tell  us  that  regardless  of  the  values  of  a  and  v, 

^  P(h  o  (a,  (m,  v))| h,  s(h ),  S)  =  P^tncwactionf/?,  v)|R,  r,  a,  S) 

m=X(h)+l 

and 

XOi)  |i?| 

^  P(h  o  (a,  (m,  v))| h,  s(h ),  S)  =  ^  P^p(updaction(R,  v,  m)\R,  r,  a,  S) 

m=  1  m= 1 

Therefore,  EPRU^tV ,  h  \  G,  S)  =  EPRU^1  (.v,  R,  r  \  G,  S).  This  completes  the  induction.  □ 

B.0.5  Repertoire-Based  Algorithm 

Now  that  we  have  a  formula  for  computing  the  EPRU  of  a  repertoire -based  strategy,  and  we  know  that  using 
repertoires  rather  than  histories  to  calculate  EPRU  gives  us  the  same  results,  we  can  update  our  algorithm 
to  use  repertoires.  The  new  algorithm  will  be  almost  identical  to  Algorithm  1,  except  that  using  repertoires 
rather  than  histories  will  allow  us  to  reduce  our  number  of  recursive  calls  by  half.  Let  R'0bs  be  the  set  of 
repertoires  for  which  PRep(R'0bs\R,  r,  Obs)  >  0,  and  define  Rjnv  and  R'x  similarly.  Note  that,  if  R  contains  m 
different  actions, 

m 

sobssnwuLK  <23) 

i=  1 
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Algorithm  4  Produce  strategy  5  that  maximizes  EPRU^  (s,  R  \  G,  S),  given  initial  repertoire  R,  and  set  of 
possible  utility  values  V. 

Strat(7?,r,  k,V, S) 

1:  if  k  =  0  then 
2:  return  0 

3:  end  if 
4:  Let  1/max  =  0 
5:  Let  .vm;lx  -  null 
6:  Let  1/obs  =  0 

7:  for  each  action  a  e  {Xi,  ••  •  ,  X^Jnv)  do 
8:  Let  1/temp  =  0 

9:  Let  Stemp  =  null 

10:  for  each  value  v  e  V  do 

11:  Let  t  =  (v,  1) 

12:  if  3/  :  (a  =  X,)  then  Let  R'  =  age(7?  \  { < v(- ,  y,)})  U  t  and  p  =  PRep{R'\R,  r,  X,,  S) 

13:  else  Let  R'  =  ag e(R)  U  t  and  p  =  PRep(R'\R,  r,  Inv,  S) 

14:  Let  p0bs  =  PR(’P(R'\R,  r,  Obs,  S) 

15:  if  p  +  pobs  >  0  then 

16:  Let  {S',U'}  =  Strata,  r  +  1,  k-  1,V,S) 

17:  if  pobs  >  0  then  U0bs  =  l/0bs  +  PObs  •  U' 

18:  if  p  >  0  then 

19:  ‘hemp  —  hemp  U  S 

20:  if  3/  .  (fl  —  X/)  then  1/temp  —  1/temp  7“  p  •  (EVeXp(r,  v)  +  U  ) 

21:  else  1/temp  —  1/temp  "7  P  '  1 

22:  end  if 

23:  end  if 

24:  end  for 

25:  if  1/temp  ^  1/max  then 

26:  1/max  '  1/temp 

27:  7max  —  hemp  hi  {R,  Cl) 

28:  end  if 

29:  end  for 

30:  if  1/obs  ^  1/max  then 

31:  1/max  =  1/obs 

32:  ^max  —  hemp  hi  {R,  Obs) 

33:  end  if 

34:  return  {smax, 

1/maxl 


In  other  words,  since  repertoires  do  not  need  to  remember  what  actions  our  agent  performed,  choosing 
Xi  produces  the  same  repertoire  as  choosing  Obs  and  observing  action  1.  Similarly,  choosing  Inv  and  Obs 
can  also  produce  the  same  repertoires,  if  both  actions  happen  to  tell  us  about  the  same  action.  However,  there 
is  no  action  we  could  encounter  through  observing  that  we  could  not  encounter  through  either  innovating 
or  exploiting.  Therefore,  if  we  save  the  results  of  the  recursive  calls  to  calculate  the  utility  of  Inv  and 
Xi, . . .  ,X„„  we  can  compute  the  utility  of  Obs  without  any  additional  recursion.  This  cuts  the  branching 
factor  of  our  algorithm  in  half,  from  (2m  +  2)v  to  (m  +  l)v,  which  reduces  the  size  of  the  search  tree  by  a 
factor  of  2k  for  search  depth  k,  without  any  impact  on  accuracy.  Algorithm  4  is  the  complete  algorithm. 
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C  Number  of  Cultaptation  Strategies 

In  Cultaptation  as  defined  on  the  tournament  web  site,  each  game  includes  10,000  rounds,  100  agents,  100 
exploitation  actions,  and  the  actions  Inv  and  Obs.  Let  S  be  the  set  of  all  pure  Cultaptation  strategies,  and 
S'  be  the  set  of  all  strategies  such  that  the  first  100  moves  arc  Inv,  and  all  subsequent  moves  arc  exploitation 
actions.  Then  any  lower  bound  on  S '  is  a  loose  lower  bound  on  S . 

Suppose  an  agent  uses  a  strategy  in  S' .  If  it  survives  for  the  first  100  rounds  of  the  game,  it  will  learn 
values  for  all  100  of  the  exploitation  actions.  There  are  100!  different  orders  in  which  these  actions  may  be 
learned,  and  for  each  action  there  arc  100  possible  values;  hence  there  arc  100 100  possible  combinations  of 
values.  Thus  after  100  Inv  moves,  the  number  of  possible  histories  is  100!  x  lOO100.  All  subsequent  moves 
by  the  agent  will  be  exploitations;  and  it  is  possible  (though  quite  unlikely!)  that  the  agent  may  live  for  the 
remaining  9, 900  rounds  of  the  game.  Thus  each  of  the  above  histories  is  the  root  of  a  game  tree  of  height 
2  x  9,900.  In  this  game  tree,  each  node  of  even  depth  is  a  choice  node  (each  branch  emanating  from  the 
node  corresponds  to  one  of  the  100  possible  exploitation  actions),  and  each  node  of  odd  depth  is  a  value 
node  (each  branch  emanating  from  the  node  corresponds  to  one  of  the  100  different  values  that  the  chosen 
action  may  return).  Since  there  are  100!  x  lOO100  of  these  game  trees,  the  total  number  of  choice  nodes  is 

9899 

100!  *  lOO100  ^(lOOV  >  9-3  x  1039953. 

d= 0 

If  we  use  the  conventional  game-theoretic  definition  that  a  pure  strategy  s  must  include  a  choice  of  action  at 
each  choice  node,  regardless  of  whether  the  choice  node  is  reachable  given  s,  then  it  follows  that 

IS'I  >  ioo9-3xl°3"53. 

If  we  use  the  definition  used  by  game-tree-search  researchers,  in  which  a  pure  strategy  only  includes  a  choice 
of  action  at  each  choice  node  that  is  reachable  given  s,  then  the  number  of  reachable  choice  nodes  given  s  is 

9899 

100!  *  lOO100  Yj  100j  >  9.4  x  1020155, 

d= 0 
SO 

IS'I  >  ioo9-4xl°20155  . 
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